[libre-riscv-dev] daily kan-ban update 18may2020

Tue May 19 13:30:23 BST 2020

Luke Kenneth Casson Leighton schreef op ma 18-05-2020 om 12:45 [+0100]:
> * and had a fascinating conversation thanks to yehowshua and jeremy(also welcome!), which resulted in this(https://libre-soc.org/3d_gpu/architecture/tomasulo_transformation/).

If I understand this correct the big architectural difference between extended scoreboarding and Tomasulo is that in the former the register content is stored in a central register file and for the latter it is distributed over several 'reservation stations'. In order to scale to for example multi-issue you need to go to higher order nRmW register files for scoreboarding and for Tomasulo you increase the number of reservation stations together with a more complex tracker of the register tagging/aliasing.
So some 2 cents from me.
From physical implementation point of view the central high order nRmW register file and scoreboard does worry me. Higher order nRmW register files will become power and area hungry compared to multiple lower order reservation stations.
I have seen numbers of a few tens of functional units in your design. I think it will become also a nightmare to connect and route all the input and outputs of all the functional units to the central register file and scoreboard. So at first sight, from physical implementation point for smaller nodes, the Tomasulo algorithm seems more scalable than extended scoreboarding. I indicated before that in smaller nodes power consumption and delay is mainly determined by the length of the interconnects and not by the input load of the logic gates itself; in 180nm it will be more fifty/fifty. As Jeremy indicated this is next to the power consumption in the register files and cache which scales with the total bit count of the block and the nRmW order of the block.
Also the travialness of a big fan-in NOR or NAND gate may be deceptive, these gates are not feasible and will be synthesized to trees of NAND/NOR gates. In that respect a high fan-in NOR/NAND can have similar time/power consumption than a seemingly more complex case of if statement. In zero order, for single output block, delay and power is determined by the number of inputs independent of the complexity of the RTL/HDL code. In first order one has to account that NAND/NOR logic is more efficient than XOR/XNOR logic but for bigger trees this difference is less pronounced as XOR/XNOR trees will be synthesized to more efficient trees using AOI (and-or-invert) cells.

greets,
Staf.