[libre-riscv-dev] cache SRAM organisation

Fri Mar 27 09:44:30 GMT 2020

On Fri, Mar 27, 2020 at 9:25 AM Staf Verhaegen <staf at fibraservi.eu> wrote:

> My point is that you will have the same performance for the fixed 5-stage pipeline running @ 800MHz

no, it won't: it'll be half the clock speed.  it won't be double the
number of computations: it'll be the exact same number of
computations.  therefore, half the speed means half the number of
computations because the *throughput* is the same

when you use "cpufreq-set" to change the clock rate, if the clock rate
is halved, the computer is twice as slow.

yes it's confusing :)

you may be thinking that the pipeline is split into *two halves*, each
half processing separate computations.  this is not the case.

> as for the 10-stage pipeline running @ 1600MHz. Why do want to run @1600MHz ?

because you get double the performance by doing so (with increase in
power consumption of course).

when bypass latches are closed:

* maximum speed: 800 mhz
* stages: 5
* latency (completion time): 1.25e-9 times 5 = 6.25e-9
* throughput: 800 MOPS

when bypass latches are open:

* maximum speed: 1600 mhz (double - appx)
* stages: 10 (double)
* latency (completion time): 6.25e-10 times *10* = 6.25e-9 *EXACTLY THE SAME*
* throughput: 1600 MOPS *DOUBLE THE MOPS*

when bypass latches are open and we run at 800mhz:

* actual speed: 800 mhz
* stages: 10 (double)
* latency (completion time): 1.25e-9 times *10* = 1.25e-8 *HALF THE LATENCY*
* throughput: 800 MOPS *SAME*

so this should show why we want to do this: you get better latency,
which in turn means that there is less pressure on us to have massive
number of Function Units (particularly Branch Prediction Units), to
compensate for huge pipeline lengths.

> Actually the fixed 5-stage 800MHz capable pipeline will not be able to
> run @1600MHz when converted to configurable 5/10-stage pipeline
> due to the additional delay from the MUXes inserted in the path
> plus the fact that you likely can't split up each stage in two stages with
> each exact the half of the delay.

yes, very much agreed.  we will need to do some careful analysis (when
there is time!  this is very much for "revision 2") right down at the
gate level.

also, it may turn out that we simply can't run certain stages at that
kind of speed, or other factors which we haven't anticipated.

in the meantime, however, things like the MUL code (which is based on
a Wallace Multiplier and needs converting to Dadda) have been
*specifically* designed as easy-to-connect combinatorial blocks, with
the "joining" part done as nmigen OO classes that can be *dynamically
replaced* by us, as programmers.

and we are currently using that flexibility to combine *multiple
combinatorial blocks* in a chain... *and* to combine those further
blocks into pipeline-latched blocks.

so if we find that one particular combinatorial chain is too long to
fit into a single 1600mhz pipeline phase, then "oh dear, whoops", all
we do is, find that code, tweak a couple of parameters in a few files,
and the combinatorial chain is now split into additional pipeline
stages.

not, "argh we have a massive redesign to do".

it's a new idea: we just have to experiment, see what happens.
honestly, if we hit even 1200mhz for a first ASIC (in 45 or 28 nm)
that would be absolutely fantastic.

l.