[libre-riscv-dev] [hw-dev] IEEE754FPU pipeline API and its use

Tue Aug 6 10:45:14 BST 2019

On Tue, Aug 6, 2019 at 2:21 AM lkcl <luke.leighton at gmail.com> wrote:
>
>
>
> On Tuesday, August 6, 2019 at 9:29:11 AM UTC+1, Jacob Lifshay wrote:
>>
>> On Tue, Aug 6, 2019, 01:04 lkcl <luke.l... at gmail.com> wrote:
>>>
>>> The features missing which are under development are tininess, rounding modes, FP flags, FMAC (complicated), FSGN (trivial to do), and the addition of multi stage multiply so that it is not necessary to have a full 53 bit multiply unit producing a 108 bit result in a single cycle.
>>
>> additionally, we do have a working pipelined dynamically-partitionable 8x8/16x4/32x2/64x1 SIMD integer multiplier that is intended to fit into the fp multiplier/mul-adder, it just hasn't been integrated into the fpu yet.
>> https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/src/multiply.py
>
>
> ah, missed that link: thanks jacob.  do you happen to recall the paper that contained the algorithm?
I didn't base the SIMD integer multiplier off of a particular paper or
design, it's basically a wallace tree multiplier with adjustments to
share the carry-save adder tree between the different partitions.

> yeah, this is a key piece of the puzzle that will allow the same ALU to be used for SIMD computation (or for multi-issue execution) *without* requiring completely separate *actual* ALUs, as would be done in a more standard design. i can't recall the exact amount (jacob, you remember the figures?), there's obviously a gate count overhead for deploying this trick, however it's nowhere near as costly as having separate multiplier ALUs.
>
> a micro-code-like FSM that propagates alongside the data, through the pipeline, will allow us to do larger-width MULs using smaller-width ALUs.  the data goes in once, through the whole pipeline, and feeds *back* to the beginning (keeping the Reservation Station "locked" throughout this process, until the last FSM step releases the final result to the RS).
I believe that was the design I proposed for the divide/sqrt/rsqrt
pipeline. the multiply/muladd pipeline should be able to do everything
in a single pass through the pipeline.

> the "feed-in/out" (loop back) will need to use those multi-in/out modules, to ensure that other results going through the pipeline (looked after by *other* Reservation Stations that share the same Concurrent Pipelined ALU) are not overwritten / destroyed / lost.
>
> this will be how, for example, we will do 64-bit FDIV using only a 32-bit-wide DIV pipeline (so that we don't have to have an absolutely insanely long 64-bit FPDIV pipeline: 24+ stages or something completely mad).  also, FMUL 64-bit could use the same trick, to save on gate count.
>
> as this is primarily to target GPU uses, 32-bit FP performance is the highest priority.  16-bit FP performance is secondary (ML, AI), and 64-bit FP performance is the lowest priority, although "nice-to-have".
>
> it results in some spectacularly weird performance numbers :) FP32 is 40 GFLOPs (GMACs) @ 800mhz, double that for FP16, and a quarter (i think) of that for 64-bit, due to the multiple phases.
I recall it being 25.6 GFLOPs of fma (12.8G fma/s) and 12.8 GFLOPs of
rsqrt (6.4G rsqrt/s -- counting rsqrt as 2 FLOPs) (assuming we widen
the div pipeline to support 64-bit operations and use it as SIMD
2x32-bit). FP16 is double FP32. FP64 is half FP32 for fma and 1/4 FP32
for rsqrt/div/sqrt.

Jacob