[libre-riscv-dev] gflops

Sun Jul 28 13:24:05 BST 2019

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Sun, Jul 28, 2019 at 1:13 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> I was calculating how many fp32 gflops our SoC would get if the div pipe
> supported simd and was 64-bits wide (needed to process fp64/i64/u64, which
> I think we should with the pass-through-the-pipeline-twice scheme).

 what's the pipeline length, there?  (in the FPU, not anywhere else).
bear in mind that the number of reservation stations *has* to be equal
to or greater than the number of pipeline stages.

 more is far better because more waiting data means branch speculation
gets a chance to run ahead.

 so if say the ALU pipeline length is 12 @ 64-bit, we will need at
least 16 RS's which is beginning to get hair-raisingly large, as it in
turn means a *32* way 32-bit Dependency Matrix.

 that in turn means a 32 x 128 DM which in turn means a staggering
quarter of a MILLION gates just on the FU-Regs Dependency Matrix alone
(unless we go with a custom cell design, which will get it down to 25%
of that).

> if a frsqrt is counted as 2 flops (div + sqrt, like fma is mul + add), then
> each core would get 12 flops/clock (2*2 for div pipe, 4*2 for mul add
> pipe), giving 60 gflops(!) at a overclock of 1.25GHz and 38.4gflops at
> 800MHz

 holy s**t that's a lot.

> fp16 would give 24 flops/clock/core (76.8gflops at 800MHz; 120gflops at
> 1.25GHz) and fp64 would give 5 flops/clock/core (16gflops at 800MHz;
> 25gflops at 1.25GHz).
>
> I think the gpu may end up with higher performance than initially planned
> (assuming the memory system keeps up), which is good in my book.

 :)  well, as long as the power consumption is reasonable, it works.

> if we want to save area (which I think will probably not be necessary), we
> could shrink the div pipe stage count by doubling the number of times fp32
> and fp64 need to go through the pipeline to 2 and 4 times respectively:
> fp16: 24flops/clock/core -- 76.8gflops at 800MHz
> fp32: 10flops/clock/core -- 32gflops at 800MHz
> fp64: 4.5flops/clock/core -- 14.4gflops at 800MHz