[libre-riscv-dev] [Bug 99] IEEE754 *pipelined* FPDIV unit needed

bugzilla-daemon at libre-riscv.org bugzilla-daemon at libre-riscv.org
Fri Jun 28 22:43:48 BST 2019


http://bugs.libre-riscv.org/show_bug.cgi?id=99

--- Comment #15 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #13)
> Just realised that FPDIVStages is a single combinatorial stage, can't do
> that.
> 
> Tomorrow I will do pipeline.py pipe1,2,3 as pipesc, pipe1,2,3,4, pipepost
> 
> The first one pipe1 will need a tiny bit of conversion at the front.
> 
> The last one, pipe4, again conversion at the end.
> 
> All of pipe1 to 6 will need to be radix8 (three bits at a time) and 2
> StageChained radix8 to give 6 bits at a time.
> 
> That gives only 4 stages, and we stand a chance of the pipeline not being
> insanely long.
> 
> 6 stages is still one hell of a lot because it will need 6 FUs at the Matrix
> to keep 100% throughput.
> 
> With 128 registers and likely something around 30 FUs we could be looking at
> a quarter million gates just for the Dependency Matrices.
with 128 registers, we really should binary encode the register portion of the
dependency matrix, since if we can take less than 256 inverters (128 and gates)
per FU, than we save gates and that allows us to have more FUs.
> 
> Getting that number down is really critical.
> 
> So the less stages in FPDIV FP32 the better.

it would probably be 11 or 12 stages, since a 32-bit integer division takes
ceil(32/3) == 11 stages for radix 8 division.

32-bit fp operations would be able to early exit 1-2 stages earlier, since they
only need 27 bits of result (24-bit mantissa + guard/round/sticky)

it might be possible to switch to radix 16 (allowing 8 or 9 stages), but I'd be
worried about excessive area (more than the full 4x32 multiplier alu) and/or
excessive per-stage delay, reducing max clock rate.

I'm thinking it'll be a good idea to allow the pipeline to compute 2x16-bit,
1x16-bit + 2x8-bit, and 4x8-bit operations per clock.

I was also thinking widening the pipeline to 64-bit wide would allow 64-bit
operations to be computed in 2 passes through the pipeline, 2x32-bit operations
in 1 pass, 4x16-bit in 1 pass, 8x8-bit in 1 pass, and aligned combinations like
the multiplier. Widening would double the area, but wouldn't take any more
pipeline stages. For radix 8, I estimate 10k-15k gates for the 64-bit wide
version, taking about the same area as the 4x32-bit multiplier. For radix 16, I
estimate about 20k-30k gates.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the libre-riscv-dev mailing list