[libre-riscv-dev] [hw-dev] IEEE754FPU pipeline API and its use

Tue Aug 6 11:10:57 BST 2019

On Tuesday, August 6, 2019 at 10:45:28 AM UTC+1, Jacob Lifshay wrote:
>
> On Tue, Aug 6, 2019 at 2:21 AM lkcl <luke.l... at gmail.com <javascript:>> 
> wrote: 
> >> additionally, we do have a working pipelined dynamically-partitionable 
> 8x8/16x4/32x2/64x1 SIMD integer multiplier that is intended to fit into the 
> fp multiplier/mul-adder, it just hasn't been integrated into the fpu yet. 
> >> 
> https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/src/multiply.py 
> > 
> > 
> > ah, missed that link: thanks jacob.  do you happen to recall the paper 
> that contained the algorithm? 
> I didn't base the SIMD integer multiplier off of a particular paper or 
> design, it's basically a wallace tree multiplier with adjustments to 
> share the carry-save adder tree between the different partitions. 
>

cool!

>
> > yeah, this is a key piece of the puzzle that will allow the same ALU to 
> be used for SIMD computation (or for multi-issue execution) *without* 
> requiring completely separate *actual* ALUs, as would be done in a more 
> standard design. i can't recall the exact amount (jacob, you remember the 
> figures?), there's obviously a gate count overhead for deploying this 
> trick, however it's nowhere near as costly as having separate multiplier 
> ALUs. 
> > 
> > a micro-code-like FSM that propagates alongside the data, through the 
> pipeline, will allow us to do larger-width MULs using smaller-width ALUs. 
>  the data goes in once, through the whole pipeline, and feeds *back* to the 
> beginning (keeping the Reservation Station "locked" throughout this 
> process, until the last FSM step releases the final result to the RS). 
> I believe that was the design I proposed for the divide/sqrt/rsqrt 
> pipeline. the multiply/muladd pipeline should be able to do everything 
> in a single pass through the pipeline. 
>

even better.

> > the "feed-in/out" (loop back) will need to use those multi-in/out 
> modules, to ensure that other results going through the pipeline (looked 
> after by *other* Reservation Stations that share the same Concurrent 
> Pipelined ALU) are not overwritten / destroyed / lost. 
> > 
> > this will be how, for example, we will do 64-bit FDIV using only a 
> 32-bit-wide DIV pipeline (so that we don't have to have an absolutely 
> insanely long 64-bit FPDIV pipeline: 24+ stages or something completely 
> mad).  also, FMUL 64-bit could use the same trick, to save on gate count. 
> > 
> > as this is primarily to target GPU uses, 32-bit FP performance is the 
> highest priority.  16-bit FP performance is secondary (ML, AI), and 64-bit 
> FP performance is the lowest priority, although "nice-to-have". 
> > 
> > it results in some spectacularly weird performance numbers :) FP32 is 40 
> GFLOPs (GMACs) @ 800mhz, double that for FP16, and a quarter (i think) of 
> that for 64-bit, due to the multiple phases. 
> I recall it being 25.6 GFLOPs of fma (12.8G fma/s) and 12.8 GFLOPs of 
> rsqrt (6.4G rsqrt/s -- counting rsqrt as 2 FLOPs)

still completely mad - 4x over our original design target :)   ahh the 
40GFLOPs figure was if we speed-boosted to 1.2ghz, wasn't it...

> (assuming we widen 
> the div pipeline to support 64-bit operations and use it as SIMD 
> 2x32-bit). FP16 is double FP32. FP64 is half FP32 for fma and 1/4 FP32 
> for rsqrt/div/sqrt.
>

thx for clarifying.

l.
l.