> > I was calculating how many fp32 gflops our SoC would get if the div pipe
> > supported simd and was 64-bits wide (needed to process fp64/i64/u64,
> which
> > I think we should with the pass-through-the-pipeline-twice scheme).
>  what's the pipeline length, there?  (in the FPU, not anywhere else).
depends on what we pick, I think a reasonable value for the pipeline where
fp32 is once-through is 6 or 7 pipeline stages (ceil(32/3/2); extra stage
for normalizing/denormalizing wiggle room) where we have 2 radix-8 stages
per pipeline stage -- any more than that and I doubt we'd hit 800MHz due to
gate delay.

bear in mind that the number of reservation stations *has* to be equal
> to or greater than the number of pipeline stages.
not actually, the pipeline would just never be fully utilized with less
reservation stations.

 more is far better because more waiting data means branch speculation
> gets a chance to run ahead.

>  so if say the ALU pipeline length is 12 @ 64-bit, we will need at
> least 16 RS's which is beginning to get hair-raisingly large, as it in
> turn means a *32* way 32-bit Dependency Matrix.
>  that in turn means a 32 x 128 DM which in turn means a staggering
> quarter of a MILLION gates just on the FU-Regs Dependency Matrix alone
> (unless we go with a custom cell design, which will get it down to 25%
> of that).
> > if a frsqrt is counted as 2 flops (div + sqrt, like fma is mul + add),
> then
> > each core would get 12 flops/clock (2*2 for div pipe, 4*2 for mul add
> > pipe), giving 60 gflops(!) at a overclock of 1.25GHz and 38.4gflops at
> > 800MHz
>  holy s**t that's a lot.
> > fp16 would give 24 flops/clock/core (76.8gflops at 800MHz; 120gflops at
> > 1.25GHz) and fp64 would give 5 flops/clock/core (16gflops at 800MHz;
> > 25gflops at 1.25GHz).
> >
> > I think the gpu may end up with higher performance than initially planned
> > (assuming the memory system keeps up), which is good in my book.
>  :)  well, as long as the power consumption is reasonable, it works.
> > if we want to save area (which I think will probably not be necessary),

if the 2-stages per pipeline stage ends up killing our clock frequency, we
could go with 1 radix-16 stage per pipeline stage (8 or 9 stages) and maybe
the below option to reduce the pipeline length (4 or 5 stages, but half the
div pipe throughput for >= 32-bit)

> we
> > could shrink the div pipe stage count by doubling the number of times
> fp32
> > and fp64 need to go through the pipeline to 2 and 4 times respectively:
> > fp16: 24flops/clock/core -- 76.8gflops at 800MHz
> > fp32: 10flops/clock/core -- 32gflops at 800MHz
> > fp64: 4.5flops/clock/core -- 14.4gflops at 800MHz

one potential option is to have the div pipe normally use 2 stages per
pipeline stage but to have (boot-time configured or at least requires a
pipeline flush to switch) muxes to insert pipeline registers between
compute stages to allow much higher frequencies (maybe 2GHz? -- not low
power mode). we would still have the same number of reservation stations,
so the pipeline utilization wouldn't ever reach 100%, but it seems like a
very simple addition that would eliminate the main culprit for clock rate

