[libre-riscv-dev] [Bug 99] IEEE754 *pipelined* FPDIV unit needed
bugzilla-daemon at libre-riscv.org
bugzilla-daemon at libre-riscv.org
Sat Jun 29 07:10:58 BST 2019
http://bugs.libre-riscv.org/show_bug.cgi?id=99
--- Comment #19 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #15)
> > With 128 registers and likely something around 30 FUs we could be looking at
> > a quarter million gates just for the Dependency Matrices.
> with 128 registers, we really should binary encode the register portion of
> the dependency matrix,
deep breath: we can't. the implications of moving to a binary encoding:
transitive register dependency relationship expression, absolutely critical
for simple multi-issue, would be destroyed.
mitch alsup explained that in a multi-issue system, all that is needed is
to *accumulate* the read/write dependency relationships from the previous
instruction.
this *requires* an unary register format because now rather than just one
bit being set, it is now *multiple* bits being set.
there is another way: a lookup table that goes between the "real"
registers and the DM. the only problem is, stall/freeze is needed
on issue if the numbers are too small.
> >
> > Getting that number down is really critical.
> >
> > So the less stages in FPDIV FP32 the better.
>
> it would probably be 11 or 12 stages, since a 32-bit integer division takes
> ceil(32/3) == 11 stages for radix 8 division.
i cannot emphasise enough how insane it is to have 12 Function Units
dedicated to one task (even if they're capable of doing more than one
job).
do we *really* need 32-bit integer DIV as a high priority? i.e. is 32-bit
DIV used anywhere in the FPU code?
> 32-bit fp operations would be able to early exit 1-2 stages earlier, since
> they only need 27 bits of result (24-bit mantissa + guard/round/sticky)
this is an optimisation.
can we please take sharing of INT/FP pipeline capabilities off the table
for this budget-limited task and move it to the (entirely separate)
proposal, which will have a lot more money (and time) to consider
serious optimisations.
> it might be possible to switch to radix 16 (allowing 8 or 9 stages), but I'd
> be worried about excessive area (more than the full 4x32 multiplier alu)
> and/or excessive per-stage delay, reducing max clock rate.
it's why i suggested two combinatorially-linked (StageChained) radix-8
blocks.
> I'm thinking it'll be a good idea to allow the pipeline to compute 2x16-bit,
> 1x16-bit + 2x8-bit, and 4x8-bit operations per clock.
if it's as complex and as time-consuming as the MUL to develop, please
create a separate bugreport, mark it as "TODO later", and we can put
it into the "FP optimisation and formal proof" proposal.
it's really important to keep this *real* simple, get a "first version"
done and *schedule* optimisations for later.
planning ahead is fine, creating classes that can be morphed or used later,
but not if doing so interferes drastically with getting this done *quickly*.
> I was also thinking widening the pipeline to 64-bit wide would allow 64-bit
> operations to be computed in 2 passes through the pipeline, 2x32-bit
> operations in 1 pass, 4x16-bit in 1 pass, 8x8-bit in 1 pass, and aligned
> combinations like the multiplier. Widening would double the area, but
> wouldn't take any more pipeline stages. For radix 8, I estimate 10k-15k
> gates for the 64-bit wide version, taking about the same area as the
> 4x32-bit multiplier. For radix 16, I estimate about 20k-30k gates.
if the classes can be done parameterised so that something may be
constructed later, under a separate funding proposal, great.
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the libre-riscv-dev
mailing list