[libre-riscv-dev] cache SRAM organisation

Thu Mar 26 21:37:12 GMT 2020

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Mar 26, 2020 at 8:18 PM Staf Verhaegen <staf at fibraservi.eu> wrote:
>
> Luke Kenneth Casson Leighton schreef op do 26-03-2020 om 13:05 [+0000]:
> > On Thursday, March 26, 2020, Staf Verhaegen <staf at fibraservi.eu> wrote:
> > > Would like to make separate side remark here. In ASICs MUXes are relativeexpensive gates with respect to delay and power. So if this principle isgenerally applied over the whole design it will make it difficult to make achip that is competitive in power/performance compared to ARM/x86 CPUs.
> >
> >
> > just the ALU pipeline registers.  we felt that the advantage of being ableto drop to say 500mhz and halve the number of pipeline stages to say 5, andalso be able to ramp up to 1.6ghz and double bavk up to 10 stages, wasworth considering.
>
> What would be the advantage over running at 800Mhz with 5 pipeline stages ?

i assume you mean fixed 5-pipeline stages.

the problem is, if you *want* to run at 1.6ghz and have complex
pipeline stages, you simply can't: 5 stages are too long, the gate
propagation delay is too large.  the only way to get to 1.6hz is:
split those 5 stages into 10 smaller stages.

the problem with _that_ is: if you then run those 10 stages at say
800mhz, or say even 400 mhz or 100mhz (because you are in power-saving
mode), you just *massively* increased the latency for completion of
any given operation.

so even though those 10 stages are so fast (because you are in 14nm)
that, at 100mhz, they complete in under 5% of a 100mhz clock rate, if
you have a fixed 10-stage pipeline you are absolutely screwed, you
*have* to have the penalty of the 10-stage pipeline latency.

screwed 1:  5-stage pipeline FORCES you to ONLY be able to run at
BELOW (e.g) 800mhz

screwed 2: 10-stage pipeline FORCES you to have massive instruction
completion latency at below (e.g.) 800mhz.

solution: give every other pipeline stage's registers a "combinatorial bypass".

un-screwed 1: when speed is above 800mhz, switch off the combinatorial
bypass, pipeline becomes 10-stage.

un-screwed 2: when speed is below 800mhz, switch ON the combinatorial
bypass, latency due to slower clock rate DISAPPEARS because all
pipelines are now only 5-stage, not 10.

according to some people on comp.arch, the first time this appeared in
commercial processors was in the 1990s, done by IBM.

l.