[libre-riscv-dev] gflops

Sun Jul 28 13:55:04 BST 2019

On Sun, Jul 28, 2019, 05:24 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>
> On Sun, Jul 28, 2019 at 1:13 PM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >
> > I was calculating how many fp32 gflops our SoC would get if the div pipe
> > supported simd and was 64-bits wide (needed to process fp64/i64/u64,
> which
> > I think we should with the pass-through-the-pipeline-twice scheme).
>
>  what's the pipeline length, there?  (in the FPU, not anywhere else).
>
depends on what we pick, I think a reasonable value for the pipeline where
fp32 is once-through is 6 or 7 pipeline stages (ceil(32/3/2); extra stage
for normalizing/denormalizing wiggle room) where we have 2 radix-8 stages
per pipeline stage -- any more than that and I doubt we'd hit 800MHz due to
gate delay.

bear in mind that the number of reservation stations *has* to be equal
> to or greater than the number of pipeline stages.
>
not actually, the pipeline would just never be fully utilized with less
reservation stations.

 more is far better because more waiting data means branch speculation
> gets a chance to run ahead.

>  so if say the ALU pipeline length is 12 @ 64-bit, we will need at
> least 16 RS's which is beginning to get hair-raisingly large, as it in
> turn means a *32* way 32-bit Dependency Matrix.
>
>  that in turn means a 32 x 128 DM which in turn means a staggering
> quarter of a MILLION gates just on the FU-Regs Dependency Matrix alone
> (unless we go with a custom cell design, which will get it down to 25%
> of that).
>
>
> > if a frsqrt is counted as 2 flops (div + sqrt, like fma is mul + add),
> then
> > each core would get 12 flops/clock (2*2 for div pipe, 4*2 for mul add
> > pipe), giving 60 gflops(!) at a overclock of 1.25GHz and 38.4gflops at
> > 800MHz
>
>  holy s**t that's a lot.
>
> > fp16 would give 24 flops/clock/core (76.8gflops at 800MHz; 120gflops at
> > 1.25GHz) and fp64 would give 5 flops/clock/core (16gflops at 800MHz;
> > 25gflops at 1.25GHz).
> >
> > I think the gpu may end up with higher performance than initially planned
> > (assuming the memory system keeps up), which is good in my book.
>
>  :)  well, as long as the power consumption is reasonable, it works.
>
> > if we want to save area (which I think will probably not be necessary),

if the 2-stages per pipeline stage ends up killing our clock frequency, we
could go with 1 radix-16 stage per pipeline stage (8 or 9 stages) and maybe
the below option to reduce the pipeline length (4 or 5 stages, but half the
div pipe throughput for >= 32-bit)

> we
> > could shrink the div pipe stage count by doubling the number of times
> fp32
> > and fp64 need to go through the pipeline to 2 and 4 times respectively:
> > fp16: 24flops/clock/core -- 76.8gflops at 800MHz
> > fp32: 10flops/clock/core -- 32gflops at 800MHz
> > fp64: 4.5flops/clock/core -- 14.4gflops at 800MHz
>

one potential option is to have the div pipe normally use 2 stages per
pipeline stage but to have (boot-time configured or at least requires a
pipeline flush to switch) muxes to insert pipeline registers between
compute stages to allow much higher frequencies (maybe 2GHz? -- not low
power mode). we would still have the same number of reservation stations,
so the pipeline utilization wouldn't ever reach 100%, but it seems like a
very simple addition that would eliminate the main culprit for clock rate
limitations.

Jacob Lifshay