programmerjake at gmail.com
Sun Jul 28 13:55:04 BST 2019
On Sun, Jul 28, 2019, 05:24 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
> On Sun, Jul 28, 2019 at 1:13 PM Jacob Lifshay <programmerjake at gmail.com>
> > I was calculating how many fp32 gflops our SoC would get if the div pipe
> > supported simd and was 64-bits wide (needed to process fp64/i64/u64,
> > I think we should with the pass-through-the-pipeline-twice scheme).
> what's the pipeline length, there? (in the FPU, not anywhere else).
depends on what we pick, I think a reasonable value for the pipeline where
fp32 is once-through is 6 or 7 pipeline stages (ceil(32/3/2); extra stage
for normalizing/denormalizing wiggle room) where we have 2 radix-8 stages
per pipeline stage -- any more than that and I doubt we'd hit 800MHz due to
bear in mind that the number of reservation stations *has* to be equal
> to or greater than the number of pipeline stages.
not actually, the pipeline would just never be fully utilized with less
more is far better because more waiting data means branch speculation
> gets a chance to run ahead.
> so if say the ALU pipeline length is 12 @ 64-bit, we will need at
> least 16 RS's which is beginning to get hair-raisingly large, as it in
> turn means a *32* way 32-bit Dependency Matrix.
> that in turn means a 32 x 128 DM which in turn means a staggering
> quarter of a MILLION gates just on the FU-Regs Dependency Matrix alone
> (unless we go with a custom cell design, which will get it down to 25%
> of that).
> > if a frsqrt is counted as 2 flops (div + sqrt, like fma is mul + add),
> > each core would get 12 flops/clock (2*2 for div pipe, 4*2 for mul add
> > pipe), giving 60 gflops(!) at a overclock of 1.25GHz and 38.4gflops at
> > 800MHz
> holy s**t that's a lot.
> > fp16 would give 24 flops/clock/core (76.8gflops at 800MHz; 120gflops at
> > 1.25GHz) and fp64 would give 5 flops/clock/core (16gflops at 800MHz;
> > 25gflops at 1.25GHz).
> > I think the gpu may end up with higher performance than initially planned
> > (assuming the memory system keeps up), which is good in my book.
> :) well, as long as the power consumption is reasonable, it works.
> > if we want to save area (which I think will probably not be necessary),
if the 2-stages per pipeline stage ends up killing our clock frequency, we
could go with 1 radix-16 stage per pipeline stage (8 or 9 stages) and maybe
the below option to reduce the pipeline length (4 or 5 stages, but half the
div pipe throughput for >= 32-bit)
> > could shrink the div pipe stage count by doubling the number of times
> > and fp64 need to go through the pipeline to 2 and 4 times respectively:
> > fp16: 24flops/clock/core -- 76.8gflops at 800MHz
> > fp32: 10flops/clock/core -- 32gflops at 800MHz
> > fp64: 4.5flops/clock/core -- 14.4gflops at 800MHz
one potential option is to have the div pipe normally use 2 stages per
pipeline stage but to have (boot-time configured or at least requires a
pipeline flush to switch) muxes to insert pipeline registers between
compute stages to allow much higher frequencies (maybe 2GHz? -- not low
power mode). we would still have the same number of reservation stations,
so the pipeline utilization wouldn't ever reach 100%, but it seems like a
very simple addition that would eliminate the main culprit for clock rate
More information about the libre-riscv-dev