[libre-riscv-dev] [Bug 99] IEEE754 *pipelined* FPDIV unit needed

bugzilla-daemon at libre-riscv.org bugzilla-daemon at libre-riscv.org
Sat Jun 29 07:10:58 BST 2019


http://bugs.libre-riscv.org/show_bug.cgi?id=99

--- Comment #19 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #15)

> > With 128 registers and likely something around 30 FUs we could be looking at
> > a quarter million gates just for the Dependency Matrices.
> with 128 registers, we really should binary encode the register portion of
> the dependency matrix, 

 deep breath: we can't.  the implications of moving to a binary encoding:
 transitive register dependency relationship expression, absolutely critical
 for simple multi-issue, would be destroyed.

 mitch alsup explained that in a multi-issue system, all that is needed is
 to *accumulate* the read/write dependency relationships from the previous
 instruction.

 this *requires* an unary register format because now rather than just one
 bit being set, it is now *multiple* bits being set.

 there is another way: a lookup table that goes between the "real"
 registers and the DM.  the only problem is, stall/freeze is needed
 on issue if the numbers are too small.


> > 
> > Getting that number down is really critical.
> > 
> > So the less stages in FPDIV FP32 the better.
> 
> it would probably be 11 or 12 stages, since a 32-bit integer division takes
> ceil(32/3) == 11 stages for radix 8 division.

 i cannot emphasise enough how insane it is to have 12 Function Units
 dedicated to one task (even if they're capable of doing more than one
 job).

 do we *really* need 32-bit integer DIV as a high priority? i.e. is 32-bit
 DIV used anywhere in the FPU code?

> 32-bit fp operations would be able to early exit 1-2 stages earlier, since
> they only need 27 bits of result (24-bit mantissa + guard/round/sticky)

 this is an optimisation.

 can we please take sharing of INT/FP pipeline capabilities off the table
 for this budget-limited task and move it to the (entirely separate)
 proposal, which will have a lot more money (and time) to consider
 serious optimisations.

> it might be possible to switch to radix 16 (allowing 8 or 9 stages), but I'd
> be worried about excessive area (more than the full 4x32 multiplier alu)
> and/or excessive per-stage delay, reducing max clock rate.

 it's why i suggested two combinatorially-linked (StageChained) radix-8
 blocks.

> I'm thinking it'll be a good idea to allow the pipeline to compute 2x16-bit,
> 1x16-bit + 2x8-bit, and 4x8-bit operations per clock.

 if it's as complex and as time-consuming as the MUL to develop, please
 create a separate bugreport, mark it as "TODO later", and we can put
 it into the "FP optimisation and formal proof" proposal.

 it's really important to keep this *real* simple, get a "first version"
 done and *schedule* optimisations for later.

 planning ahead is fine, creating classes that can be morphed or used later,
 but not if doing so interferes drastically with getting this done *quickly*.


> I was also thinking widening the pipeline to 64-bit wide would allow 64-bit
> operations to be computed in 2 passes through the pipeline, 2x32-bit
> operations in 1 pass, 4x16-bit in 1 pass, 8x8-bit in 1 pass, and aligned
> combinations like the multiplier. Widening would double the area, but
> wouldn't take any more pipeline stages. For radix 8, I estimate 10k-15k
> gates for the 64-bit wide version, taking about the same area as the
> 4x32-bit multiplier. For radix 16, I estimate about 20k-30k gates.

 if the classes can be done parameterised so that something may be
 constructed later, under a separate funding proposal, great.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the libre-riscv-dev mailing list