[libre-riscv-dev] [hw-dev] IEEE754FPU pipeline API and its use

Tue Aug 6 10:21:03 BST 2019

On Tuesday, August 6, 2019 at 9:29:11 AM UTC+1, Jacob Lifshay wrote:
>
> On Tue, Aug 6, 2019, 01:04 lkcl <luke.l... at gmail.com <javascript:>> wrote:
>
>> The features missing which are under development are tininess, rounding 
>> modes, FP flags, FMAC (complicated), FSGN (trivial to do), and the addition 
>> of multi stage multiply so that it is not necessary to have a full 53 bit 
>> multiply unit producing a 108 bit result in a single cycle.
>>
> additionally, we do have a working pipelined dynamically-partitionable 
> 8x8/16x4/32x2/64x1 SIMD integer multiplier that is intended to fit into the 
> fp multiplier/mul-adder, it just hasn't been integrated into the fpu yet.
>
> https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/src/multiply.py
>

ah, missed that link: thanks jacob.  do you happen to recall the paper that 
contained the algorithm?

yeah, this is a key piece of the puzzle that will allow the same ALU to be 
used for SIMD computation (or for multi-issue execution) *without* 
requiring completely separate *actual* ALUs, as would be done in a more 
standard design. i can't recall the exact amount (jacob, you remember the 
figures?), there's obviously a gate count overhead for deploying this 
trick, however it's nowhere near as costly as having separate multiplier 
ALUs.

a micro-code-like FSM that propagates alongside the data, through the 
pipeline, will allow us to do larger-width MULs using smaller-width ALUs.  
the data goes in once, through the whole pipeline, and feeds *back* to the 
beginning (keeping the Reservation Station "locked" throughout this 
process, until the last FSM step releases the final result to the RS).

the "feed-in/out" (loop back) will need to use those multi-in/out modules, 
to ensure that other results going through the pipeline (looked after by 
*other* Reservation Stations that share the same Concurrent Pipelined ALU) 
are not overwritten / destroyed / lost.

this will be how, for example, we will do 64-bit FDIV using only a 
32-bit-wide DIV pipeline (so that we don't have to have an absolutely 
insanely long 64-bit FPDIV pipeline: 24+ stages or something completely 
mad).  also, FMUL 64-bit could use the same trick, to save on gate count.

as this is primarily to target GPU uses, 32-bit FP performance is the 
highest priority.  16-bit FP performance is secondary (ML, AI), and 64-bit 
FP performance is the lowest priority, although "nice-to-have".

it results in some spectacularly weird performance numbers :) FP32 is 40 
GFLOPs (GMACs) @ 800mhz, double that for FP16, and a quarter (i think) of 
that for 64-bit, due to the multiple phases.

l.