[libre-riscv-dev] Tomasulo and Scoreboards

Tue May 19 14:47:36 BST 2020

On Tuesday, May 19, 2020, Staf Verhaegen <staf at fibraservi.eu> wrote:

> Luke Kenneth Casson Leighton schreef op ma 18-05-2020 om 12:45 [+0100]:
> > * and had a fascinating conversation thanks to yehowshua and jeremy(also
> welcome!), which resulted in this(https://libre-soc.org/3d_
> gpu/architecture/tomasulo_transformation/).
>
> If I understand this correct the big architectural difference between
> extended scoreboarding and Tomasulo is that in the former the register
> content is stored in a central register file and for the latter it is
> distributed over several 'reservation stations'.

not quite. this is the number one misunderstanding about Scoreboards, due
to the Academic exclusive focus on the Q Tables.  it is similar to
focussing on the ROB and concluding that Tomasulo equals ROB therefore
Tomasulo is incompetent and incomplete, incapable of Out of Order execution.

 the reality is very different: there is actually no functional difference.

* both have central register files
* both have "in flight" distributed "storage" that hold results not yet
committed to the regfile.

the difference is:

* Tomasulo identified in-flight as "being a problem", gave a name to the
concept ("register renaming") and used CAMs in the form of Reservation
Stations to explicitly "house" this distributed in-flight data.

* Seymour Cray and James Thornton simply got on with it, coming up with an
in-flight solution where there *was* no need to give the in-flight data
"names".  they were just "Computation Unit operand latch/registers".

because they did not see it as an actual "problem", there was nothing to
"talk about".  it is only through retrospective anslysis that we find that
they solved the "problem".

what i can tell you is:

* those latches in the Computation Unit *are* Reservation Station latches
* a CAM is *NOT* needed
* they *are* 1R1W and DFFs are perfectly adequate to cover them
* centralising them is NOT needed and it would be severely detrimental to
try to do so.

> In order to scale to for example multi-issue you need to go to higher
> order nRmW register files for scoreboarding

this is necessary for Tomasulo as well.  i am referring to the INT Regfile,
not to the (misunderstood) inflight data latches local to RSes /
Computation Units.

>
> and for Tomasulo you increase the number of reservation stations

likewise you increase the number of Computation Units because doing so
increases inflight opportunity.

>
>  together with a more complex tracker of the register tagging/aliasing.

in Tomasulo, yes.  it quickly becomes a nightmare, at the RS CAMs, the CDB,
the ROB - in fact everything to do with the entire Tomasulo algorithm.

for Scoreboards, no.  the exact same logic, by encoding all Function Unit
and Register Numbers in UNARY not bibary, is autonatically and inherently
multi-issue capable BY DESIGN.

almost zero extra logic is required to make Unary (bitvector) Scoreboards
multi issue.

> So some 2 cents from me.
> From physical implementation point of view the central high order nRmW
> register file and scoreboard does worry me.

due to the (very common) misunderstanding, given that there is no
centralised scoreboard regfile, given that the inflight data you (and all
Academic literature misleadingly states) is distributed not centralised,
this concern is moot.

we will use standard DFFs to store the in-flight data, as an absolutely
standard "latch/register", at the CompUnit. this has been implemented over
a year ago and has been working for over a year.

> Higher order nRmW register files will become power and area hungry
> compared to multiple lower order reservation stations.
> I have seen numbers of a few tens of functional units in your design. I
> think it will become also a nightmare to connect and route all the input
> and outputs of all the functional units to the central register file and
> scoreboard.

i am going over this now.  the different regfiles (Condition, INT, SPR,
XER) are separate and so from a SRAM porting perspective all need 1W (with
the exception of LDST Update which can be timesliced)

the maximum number of read ports for any one Regfile SRAM is 3.  4 would be
nice.

so we only need 4R1W which is doable.

when it comes to multi issue this is where the stratification ODD-EVEN
Regfile numbering comes into play, and we *still* only need 4R1W.

>  So at first sight, from physical implementation point for smaller nodes,
> the Tomasulo algorithm seems more scalable than extended scoreboarding.

it definitely isn't.

to get multi issue in Tomasulo you need multiple Common Data Buses,
otherwise it becomes even more a bottleneck than it already is. assuming 4
issue, to achieve this the ROB needs to be made a multi ported (4R4W) SRAM,
and, worse, the ReservationStation CAMs now also need to be 4R4W.

plus some other nightmare aspects.

>  I indicated before that in smaller nodes power consumption and delay is
> mainly determined by the length of the interconnects and not by the input
> load of the logic gates itself; in 180nm it will be more fifty/fifty.

ah, appreciated the insight

with the Computation Units latches being distributed, not centralised,
these latches capture the operands close to the point where they will be
fed into the pipelines.

i plan to have the pipelines "double back" on themselves, placing the
result *back* where the Computation Unit may latch the result, easily, then
wait for the Regfile Common Data Bus dedicated to writing to the regfile.

these Broadcast Buses are an unavoidable necessity and will need careful
design and layout, and buffers to ensure they can be driven at speed.

both Tomasulo and Scoreboards require these large fanout / fanin Broadcast
Regfile Buses.

except that in Scoreboards, READ is completely separate from WRITE.
therefore, READ is fanout ONLY and WRITE is fanin ONLY.

this makes a crucial difference because the Tomasulo CDB is a single path
READ *AND* WRITE contended global resource.

single for delivery of *ALL* inflight data.

 As Jeremy indicated this is next to the power consumption in the register
> files and cache which scales with the total bit count of the block and the
> nRmW order of the block.

4R1W.

> Also the travialness of a big fan-in NOR or NAND gate may be deceptive,
> these gates are not feasible and will be synthesized to trees of NAND/NOR
> gates.

perfect.  as expected.  diagrams in Mitch's book chapters show this being
done, especially on the 32 reg vectors.

>
>  In that respect a high fan-in NOR/NAND can have similar time/power
> consumption than a seemingly more complex case of if statement. In zero
> order, for single output block, delay and power is determined by the number
> of inputs independent of the complexity of the RTL/HDL code. In first order
> one has to account that NAND/NOR logic is more efficient than XOR/XNOR
> logic but for bigger trees this difference is less pronounced as XOR/XNOR
> trees will be synthesized to more efficient trees using AOI (and-or-invert)
> cells.

appreciate the insights, Staf.

to summarise, then:

* we have some continuous misunderstandings from the Academic literature
which people continue to believe, and need to be stomped on whenever they
occur.  nicely.
* we have some big NOR/NAND gates (32 in) which create a cascade.  these
are expected.
* DFFs are used at the decentralised Computation Units to store
decentralised inflight data
* Centralised 4R1W Regfiles store centralised register data.
* Unidirectional Broadcast Buses transfer data between Comp Units and
Regfiles.

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68