[libre-riscv-dev] Clock Gating (was cache SRAM organisation)

Sat Mar 28 14:08:25 GMT 2020

Luke Kenneth Casson Leighton schreef op vr 27-03-2020 om 10:59 [+0000]:
> On Fri, Mar 27, 2020 at 10:36 AM Staf Verhaegen <staf at fibraservi.eu> wrote:
> > Yes and no, it is the basic functionality of a pipeline :(
> 
> yes.
> > You have the same latency but can have double the number of operations in flight.
> 
> yes.  hence why it is so important to have, because double the numberof operations means that we need double the number of Function Unitsin the Dependency Matrix in order to keep the entire out-of-orderengine occupied.
> also, double the number of operations in flight means that we needdouble the number of Branch Prediction Units, and much more complexBPUs at that, just to deal with the (now very likely) scenario ofhaving far more overlapping inner loops "in flight".
> all this from just extending the pipeline length(s) from 5 to 10.  soit's not just a "nice-to-have" feature, it's actually really importantto keeping the overall size of the chip down.

There is an (IMO better) alternative for what you are doing with your pass-through registers and that is clock gating (wikipedia, allaboutcircuits).
The principle is that you save power by not clocking the parts of the circuit that don't have to do any computing. I think this could be a more general way to only enable the stages in your pipeline who actually are doing computation.
In the above example you would always use a 10 stage pipeline running at 1600MHz but to mimic the 5-stage pipeline you only submit an operation every other clock cycle and intermittently enable the odd and even stages in your pipeline. This way the MUXes are removed from the computation path.
Using a shift register it could be easily generalized to only enable the stages for which there is an operation going through the pipeline. When an operation is submitted you set the first bit in the shift register to enable the first stage in the pipeline. With each cycle you then shift this bit so the stage that is needed for the execution of that operation is active.
This is generalized power optimization because it means that if you are running a program that only uses integer operations your FPU and GPU with use almost no power.

The way to implement it is using EnableInserter. Some untested code how I think it can be done:

	stages_en = Signal(10)
	stage1 = EnableInserter(stages_en[0])(Stage1())
	stage2 = EnableInserter(stages_en[1])(Stage2())
	...

	m.d.sync += stages_en.eq(Cat(newop, stages_en[0:9]))

That said I think this feature does not fit in the MVP scope of the October prototype so that chip should IMO not use clock gating nor the pass-through register feature from the original discussion. Reason is that implementing it is easier said than done. Several things need to be done:
- You first need a clock gating cell. This is not available in nsxlib and is currently not planned to be implemented. I don't want to commit to something extra for the May test chip tape-out either.
- nmigen/yosys needs to properly support clock gating for ASICs. Likely this means work in yosys that insert the clock gates from if clauses in the RTL.
- Your P&R tool (e.g. Coriolis) needs to support the clock gates. It means your clock tree synthesis (CTS) needs to support more than just buffers in the clock tree. This is not a simple task and has to be discussed with Jean-Paul & co.

greets,
Staf.