[libre-riscv-dev] cache SRAM organisation

Staf Verhaegen staf at fibraservi.eu
Fri Mar 27 09:25:24 GMT 2020

Luke Kenneth Casson Leighton schreef op do 26-03-2020 om 21:37 [+0000]:
> ---crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
> On Thu, Mar 26, 2020 at 8:18 PM Staf Verhaegen <staf at fibraservi.eu> wrote:
> > Luke Kenneth Casson Leighton schreef op do 26-03-2020 om 13:05 [+0000]:
> > > On Thursday, March 26, 2020, Staf Verhaegen <staf at fibraservi.eu> wrote:
> > > > Would like to make separate side remark here. In ASICs MUXes are relativeexpensive gates with respect to delay and power. So if this principle isgenerally applied over the whole design it will make it difficult to make achip that is competitive in power/performance compared to ARM/x86 CPUs.
> > > 
> > > 
> > > just the ALU pipeline registers.  we felt that the advantage of being ableto drop to say 500mhz and halve the number of pipeline stages to say 5, andalso be able to ramp up to 1.6ghz and double bavk up to 10 stages, wasworth considering.
> > 
> > What would be the advantage over running at 800Mhz with 5 pipeline stages ?
> i assume you mean fixed 5-pipeline stages.
> the problem is, if you *want* to run at 1.6ghz and have complexpipeline stages, you simply can't: 5 stages are too long, the gatepropagation delay is too large.  the only way to get to 1.6hz is:split those 5 stages into 10 smaller stages.
> the problem with _that_ is: if you then run those 10 stages at say800mhz, or say even 400 mhz or 100mhz (because you are in power-savingmode), you just *massively* increased the latency for completion ofany given operation.
> so even though those 10 stages are so fast (because you are in 14nm)that, at 100mhz, they complete in under 5% of a 100mhz clock rate, ifyou have a fixed 10-stage pipeline you are absolutely screwed, you*have* to have the penalty of the 10-stage pipeline latency.
> screwed 1:  5-stage pipeline FORCES you to ONLY be able to run atBELOW (e.g) 800mhz
> screwed 2: 10-stage pipeline FORCES you to have massive instructioncompletion latency at below (e.g.) 800mhz.
> solution: give every other pipeline stage's registers a "combinatorial bypass".
> un-screwed 1: when speed is above 800mhz, switch off the combinatorialbypass, pipeline becomes 10-stage.
> un-screwed 2: when speed is below 800mhz, switch ON the combinatorialbypass, latency due to slower clock rate DISAPPEARS because allpipelines are now only 5-stage, not 10.

My point is that you will have the same performance for the fixed 5-stage pipeline running @ 800MHz as for the 10-stage pipeline running @ 1600MHz. Why do want to run @1600MHz ?
Actually the fixed 5-stage 800MHz capable pipeline will not be able to run @1600MHz when converted to configurable 5/10-stage pipeline due to the additional delay from the MUXes inserted in the path plus the fact that you likely can't split up each stage in two stages with each exact the half of the delay.

More information about the libre-riscv-dev mailing list