[libre-riscv-dev] Scoreboard vs Tomasulo

Sat May 16 16:00:11 BST 2020

On Saturday, May 16, 2020, Yehowshua <yimmanuel3 at gatech.edu> wrote:

> This website gives an excellent comparison
>
> https://www.cs.umd.edu/~meesh/cmsc411/website/projects/dynamic/intro.html
> <https://www.cs.umd.edu/~meesh/cmsc411/website/projects/dynamic/intro.html
> >

ah good find.  particularly because, like all Academic literature on the
6600, it is factually partially correct but at the same time dangerously
misleading and factually wrong.  this for example is wrong:

However, the scoreboard is limited in that it does not handle WAR and WAW
hazards very well.

the original 6600 handles WAR extremely well, only stalling on WaW
condition, detection which did not matter greatly because the pipelines
were only at most 2 stages long anyway (Mitch only noticed after rereading
last year that the FP ADD of the 6600 was 2 stage pipelined.  no academic
literature has acknowledged or noticed this).

you have to understand that the majority of designs at the time were around
10 to 12 clocks per instruction.  the CDC6600 got that down to FOUR and
consequently represented a 2 to 3 fold increase in performance.

> In general, it says that Tomasulo is a distributed scoreboard.

mmm no.  the ROB, which is the key component, is centralised, and the CDB
is a central resource.

> The biggest drawback of Tomasulo is the central data bus as the messaging
> scheme. One could use a mesh NOC between functional units.

ah, not quite the biggest: as you correctly say, all you need is multiple
CDBs and that problem is solved.

no, the biggest draw of Tomasulo is that the ROB (reorder buffer) numbering
is in *binary*, and bear in mind that the ROB is a CAM, the larger that
gets the more power it takes.

let us say we have a 32 entry ROB, let us say we have 5 bits for the ROB
CAM key.

this not only means that when trying to put a result into the CAM we have
5x32 XOR gates to fire - on every cycle - if we want 2 Common Data Buses we
now need DOUBLE that...

... *and we need clash detection*.

by contrast, 6600 scoreboards have *unary* numbering and consequently the
CAM is degenerate and becomes a single AND gate activation.

not only that but if you wish to check or activate more than one register
simultaneously then that... is... just... more than one corresponding AND
gate activating at the CAM.

perfectly simple, very little power, perfect for dropping multi issue on
top.

> Given that we have well over 15 functional units, I would have thought a
> distributed scoreboard instead of a centralized one makes more sense.

the document, whilst useful as a starting point, has misled you on two
counts (so far)

the key is that binary numbering requires power sucking CAMs.  hence why
the 6600 scheme, augmented to precise, and even the original 6600 Q Table
numbering has been converted from binary to unary.

not Tomasulo.

actually.... in 6600, the DMs are still centralised, it is however the case
that the unary numbering results implicitly in opportunities for
parallelism (see below) and consequently could be termed "distributed".

also, the role of the Function Units, the GORD/REQ latching, that is done
as part of the critical acknowledgement and communication *with* the
scoreboards, and is done very close to the pipelines, *not* in the DMs
themselves, so in that regard, yes it is "distributed".

Function Units are equivalent to "Reservation Station Rows" from Tomasulo
terminology.  the multiple rows per Tomasulo Reservation Station *also*
requires that those be CAMs!

yet more power-sucking!

given that Intel processors use Tomasulo, we start to see why Intel
processors suck so much power.

> Of course, I have no numbers to back this up. But these are just some
> thoughts.
>
>
> I know we’re using a ring/circular buffer for messaging at the moment.

yes for the future version.

for the simple 180nm version it will be simple direct regfile port
broadcast buses, connected one to one with the corresponding Function Unit
Operand input.

thus, FU operand 1 will be directly connected to Regfile Port1 Broadcast
Bus.

FU operand 2 - if there *is* an operand 2 - will connect to Regfile Port2
Broadcast Bus.

*if we have time* then we can drop in the cyclic buffers, and when data
comes out of Regfile Port2 it is cyclically shifted to Op... 1 or Op3 or
whatever is required.

> Given that we have hard deadlines and limited resources, we can stick with
> this. But these are questions I do wonder about the answer to. I wonder if
> there are any papers where this has been explored at sufficient depth to
> draw conclusions.

i encountered when i investigated Tomasulo a way to do multi issue in an
academic paper.

the method, which involved stratifying the Reorder Buffer into 4 separate
slices, was awful.

the binary numbering on the ROB causes massive headaches because it
required special queues of binary ROB indices to represent the multi issue
requests.

when converted to unary, multiple bits may be set to indicate REG1 REG5
REG7 *in one cycle* on the *same wires* because one wire is dedicated to
each reg.

this is precisely and exactly what you need for multi issue and it is so
laughably simple.

l .

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68