[libre-riscv-dev] [hw-dev] IEEE754FPU pipeline API and its use
lkcl
luke.leighton at gmail.com
Tue Aug 6 11:10:57 BST 2019
On Tuesday, August 6, 2019 at 10:45:28 AM UTC+1, Jacob Lifshay wrote:
>
> On Tue, Aug 6, 2019 at 2:21 AM lkcl <luke.l... at gmail.com <javascript:>>
> wrote:
> >> additionally, we do have a working pipelined dynamically-partitionable
> 8x8/16x4/32x2/64x1 SIMD integer multiplier that is intended to fit into the
> fp multiplier/mul-adder, it just hasn't been integrated into the fpu yet.
> >>
> https://salsa.debian.org/Kazan-team/simple-barrel-processor/blob/master/src/multiply.py
> >
> >
> > ah, missed that link: thanks jacob. do you happen to recall the paper
> that contained the algorithm?
> I didn't base the SIMD integer multiplier off of a particular paper or
> design, it's basically a wallace tree multiplier with adjustments to
> share the carry-save adder tree between the different partitions.
>
cool!
>
> > yeah, this is a key piece of the puzzle that will allow the same ALU to
> be used for SIMD computation (or for multi-issue execution) *without*
> requiring completely separate *actual* ALUs, as would be done in a more
> standard design. i can't recall the exact amount (jacob, you remember the
> figures?), there's obviously a gate count overhead for deploying this
> trick, however it's nowhere near as costly as having separate multiplier
> ALUs.
> >
> > a micro-code-like FSM that propagates alongside the data, through the
> pipeline, will allow us to do larger-width MULs using smaller-width ALUs.
> the data goes in once, through the whole pipeline, and feeds *back* to the
> beginning (keeping the Reservation Station "locked" throughout this
> process, until the last FSM step releases the final result to the RS).
> I believe that was the design I proposed for the divide/sqrt/rsqrt
> pipeline. the multiply/muladd pipeline should be able to do everything
> in a single pass through the pipeline.
>
even better.
> > the "feed-in/out" (loop back) will need to use those multi-in/out
> modules, to ensure that other results going through the pipeline (looked
> after by *other* Reservation Stations that share the same Concurrent
> Pipelined ALU) are not overwritten / destroyed / lost.
> >
> > this will be how, for example, we will do 64-bit FDIV using only a
> 32-bit-wide DIV pipeline (so that we don't have to have an absolutely
> insanely long 64-bit FPDIV pipeline: 24+ stages or something completely
> mad). also, FMUL 64-bit could use the same trick, to save on gate count.
> >
> > as this is primarily to target GPU uses, 32-bit FP performance is the
> highest priority. 16-bit FP performance is secondary (ML, AI), and 64-bit
> FP performance is the lowest priority, although "nice-to-have".
> >
> > it results in some spectacularly weird performance numbers :) FP32 is 40
> GFLOPs (GMACs) @ 800mhz, double that for FP16, and a quarter (i think) of
> that for 64-bit, due to the multiple phases.
> I recall it being 25.6 GFLOPs of fma (12.8G fma/s) and 12.8 GFLOPs of
> rsqrt (6.4G rsqrt/s -- counting rsqrt as 2 FLOPs)
still completely mad - 4x over our original design target :) ahh the
40GFLOPs figure was if we speed-boosted to 1.2ghz, wasn't it...
> (assuming we widen
> the div pipeline to support 64-bit operations and use it as SIMD
> 2x32-bit). FP16 is double FP32. FP64 is half FP32 for fma and 1/4 FP32
> for rsqrt/div/sqrt.
>
thx for clarifying.
l.
l.
More information about the libre-riscv-dev
mailing list