[libre-riscv-dev] [hw-dev] IEEE754FPU pipeline API and its use
lkcl
luke.leighton at gmail.com
Tue Aug 6 10:21:03 BST 2019
On Tuesday, August 6, 2019 at 9:29:11 AM UTC+1, Jacob Lifshay wrote:
>
> On Tue, Aug 6, 2019, 01:04 lkcl <luke.l... at gmail.com <javascript:>> wrote:
>
>> The features missing which are under development are tininess, rounding
>> modes, FP flags, FMAC (complicated), FSGN (trivial to do), and the addition
>> of multi stage multiply so that it is not necessary to have a full 53 bit
>> multiply unit producing a 108 bit result in a single cycle.
>>
> additionally, we do have a working pipelined dynamically-partitionable
> 8x8/16x4/32x2/64x1 SIMD integer multiplier that is intended to fit into the
> fp multiplier/mul-adder, it just hasn't been integrated into the fpu yet.
>
> https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/src/multiply.py
>
ah, missed that link: thanks jacob. do you happen to recall the paper that
contained the algorithm?
yeah, this is a key piece of the puzzle that will allow the same ALU to be
used for SIMD computation (or for multi-issue execution) *without*
requiring completely separate *actual* ALUs, as would be done in a more
standard design. i can't recall the exact amount (jacob, you remember the
figures?), there's obviously a gate count overhead for deploying this
trick, however it's nowhere near as costly as having separate multiplier
ALUs.
a micro-code-like FSM that propagates alongside the data, through the
pipeline, will allow us to do larger-width MULs using smaller-width ALUs.
the data goes in once, through the whole pipeline, and feeds *back* to the
beginning (keeping the Reservation Station "locked" throughout this
process, until the last FSM step releases the final result to the RS).
the "feed-in/out" (loop back) will need to use those multi-in/out modules,
to ensure that other results going through the pipeline (looked after by
*other* Reservation Stations that share the same Concurrent Pipelined ALU)
are not overwritten / destroyed / lost.
this will be how, for example, we will do 64-bit FDIV using only a
32-bit-wide DIV pipeline (so that we don't have to have an absolutely
insanely long 64-bit FPDIV pipeline: 24+ stages or something completely
mad). also, FMUL 64-bit could use the same trick, to save on gate count.
as this is primarily to target GPU uses, 32-bit FP performance is the
highest priority. 16-bit FP performance is secondary (ML, AI), and 64-bit
FP performance is the lowest priority, although "nice-to-have".
it results in some spectacularly weird performance numbers :) FP32 is 40
GFLOPs (GMACs) @ 800mhz, double that for FP16, and a quarter (i think) of
that for 64-bit, due to the multiple phases.
l.
More information about the libre-riscv-dev
mailing list