[Libre-soc-bugs] [Bug 413] DIV "trial" blocks are too large

bugzilla-daemon at libre-soc.org bugzilla-daemon at libre-soc.org
Fri Jul 3 14:28:14 BST 2020


https://bugs.libre-soc.org/show_bug.cgi?id=413

--- Comment #14 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #12)
> I am planning on rewriting DivPipeCore to allow statically disabling some
> ops (sqrt, rsqrt),

i started on that last night, added an option to DivPipeCoreConfig to
select a subset of operations.  nothing sophisticated however it cut
the gates needed in half.

> it will also support looping the pipeline back around on
> itself to support 32-bit ops by running the op through the pipeline twice
> and 64-bit ops by running through the pipeline 4 times.

yes i like this idea.  i did actually create a pipeline "loop" example
however i ran into difficulties on synchronising the incoming and outgoing,
it ran into a combinatorial loop, and so stopped work on it at the time.

it was designed for exactly these scenarios: feedback data for further
processing.

the issue is that you need not just the main pipeline, you need a fan-in
and a fan-out, where the fan-out will check the loop count and send back
if non-zero, and send forward if zero.

the loopback fan-in needs to be *absolute* top priority (block) incoming data, 
because otherwise the entire pipeline must stall... or throw away the data.
both are bad options, so the loopback fan-in has to have top priority.


> It will also support
> 8 and 16-bit ops run through once. That will make scheduling a little harder
> to implement, but should still be quite trivial, since all that's needed is
> to delay issuing ops for each cycle that something is looping back.

yes, it needs proper synchronisation: a 2-to-1 at the top and a 1-to-2 at
the bottom.

my thoughts are that this should be done using the ReservationStations
class, because it already exists and i know it works.

whereas the fan-in fan-out experiment failed to work properly, particularly
when mask (cancellation) is involved.

i will draw it out, give me a couple of hours.

basically you have *TWO* context IDs (ctx.mux_id - see FPPipeContext),
and double the number of ReservationStations (quadruple if you need
4x recycling).

the recycling (looping) is done by:

* connecting the CompUnit inputs to RS #1

* letting RS #1 schedule a slot to the pipeline, with "ctx.mux_id=1"

* RS #1 receives the 1st "processed" data and its output is connected
  to *RS #2* input

* RS #2 now has ctx.mux_id=2

* RS #2 puts the data into the pipeline a 2nd time, passes to RS #3
...
...
* RS #4 receives the data processed the 4th time and this is connected
  to the CompUnit result latches.


now, if we need to do 8x ReservationStations (for the large GPU) then
this becomes *32* ReservationStations where the connectivity chain is:


* src-Compunit 0 -> RS 0 -> RS 8 -> RS 16 -> RS 24 -> dest-CompUnit 0
* src-Compunit 1 -> RS 1 -> RS 9 -> RS 17 -> RS 25 -> dest-CompUnit 1
...
....
* src-CompUnit 7 -> RS15 ..... dest-CompUnit 7

so it's "striped" connectivity.

i can do a demo / example of this, based on one of the existing
ReservationStation unit tests, very quickly.


> I think I should probably also add support for SIMD partitioning (also
> statically disable-able), since it won't make it much more complex and would
> easily do 2x 32-bit div every 2 cycles or 4x 16-bit div every cycle or 8x
> 8-bit div every cycle. The SIMD functionality would be more complex to
> schedule, so would be statically disabled for the oct tapeout, though we
> would still want the ability to take half as many trips through the pipeline
> for 32-bit div.

we're incredibly short on time.  if it's not taking a few days to write,
we really have to rule it out.

that October 2020 deadline is a *hard* immovable deadline.


> If there isn't enough time, I can write a simple radix 4 or 8 fsm-based div
> in probably a day or two.

can you start on that straight away: you *should* - hypothetically -
be able to usethe DivPipe combinatorial blocks without modifications,
by simply setting up a register "thing" that captures the ospec() and
feeds it to the ispec() on the next clock cycle.

this was the basis of the FSM support i added (except it turned out to
be too awkward to keep actively maintained).


ok, apart from passing in the shift amounts as actual Signals rather than
Consts.

which really might justify a from-scratch implementation, yeah.

this is very good, and really short:
https://github.com/antonblanchard/microwatt/blob/master/divider.vhdl

the FSM should easily fit into 100 lines of code


by the time you've got that done i should have an example "striping"
ReservationStation working.

which actually we can use for general "micro-coding" purposes (including MUL). 
i wasn't planning on doing one, but... *sigh*.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the libre-soc-bugs mailing list