[libre-riscv-dev] Scoreboard vs Tomasulo

Sat May 16 22:37:40 BST 2020

On Sat, May 16, 2020 at 9:16 PM Jeremy Singher <thejsingher at gmail.com> wrote:

> > However, the scoreboard is limited in that it does not handle WAR and WAW
> > hazards very well.
>
> > the original 6600 handles WAR extremely well, only stalling on WaW
> > condition, detection which did not matter greatly because the pipelines
> > were only at most 2 stages long anyway (Mitch only noticed after rereading
> > last year that the FP ADD of the 6600 was 2 stage pipelined.  no academic
> > literature has acknowledged or noticed this).
>
> I would argue that the CDC6600 handles WAR better than an in-order
> machine, but not better than a machine with full renaming.

Jeremy: 6600 designs *do* have register "renaming".  i said already,
a number of times.  and if you have studied and listened to the Academic
literature - and believe it - you are misinformed by that literature, and will
continue to be misinformed by that literature.

please understand: i have spent over nine months studying this, learning
from Mitch Alsup, one of the world's leading high-performance computing
architects, implementing this and it's not going to get chucked out to
implement something that takes six MONTHS to fully understand is a
better design, to be replaced by something that would take another six
months to implement *and we then could not do multi-issue on top of it*.

please therefore take the time to understand this design, particularly when
we are right in the middle of a time-critical deadline.

> In the
> CDC6600, the scoreboard would block the completion of the W inst until
> the R inst has read its operands. In Tomasulo, the W inst can proceed
> immediately before the R inst is ready, since the destination register
> is physically renamed.

Jeremy: the "renaming" carried out by the 6600 does exactly this same
job.

Academics unfortunately have *only* studied the *patent* - not Thornton's
book nor read Mitch Alsup's book chapters which he wrote based on 30+
years experience of implementing 6600-style execution in the 88100,
AMD K9 and AMD Opteron Series of CPUs.

the 6600 patent *ONLY* covers the Q-Tables.

the description of the Q-Tables does *NOT* include the Computation Units
where the operand latches are.

these latches *ARE* the "renamed registers" (actually, nameless registers).

> >  ROB is a CAM,
>
> What? This is not true. Instructions in flight track their index in
> the ROB, at completion, they write success bit according to the index
> they have. Not a CAM. Simple addressed write.

ah appreciated the correction.  i knew there were CAMs somewhere.
when i discovered that, i dropped further investigation.

i am however aware that if you wish to do multi-issue, the ROB will need
to be multi-ported SRAM.  without multi-porting the SRAM, you simply
cannot write multiple entries.

this is where the problems (read: nightmares) for Tomasulo start to manifest,
and progress from there through multiple Common Data Bus channels into
multi-ported (NxR, NxW) CAMs at the Reservation Station level.

i've been through this, ok, over 18 months ago.

by contrast, when converted to single-bit unary vectors, each entry in the
Matrices (covered by a single SR NAND latch) is *independent*.  multiple
of those may be set in a single cycle.

there has been a *massive* amount of thought gone into this, ok?

> > given that Intel processors use Tomasulo, we start to see why Intel
> > processors suck so much power.
>
> Power consumption on a processor is primarily from high switching
> rates on large register files, and SRAM leakage.

given that we're planning around 28 Function Units and have a 128-entry
register file, the calculations that i did for that would put the Dependency
Matrices around a whopping 50,000 gates, using SR NAND Latches.

(if we did it as DFFs it would be an unacceptable 250,000 gates)

> The scheduling logic is a miniscule portion.

not for the large-scale out-of-order systems that we are planning to
do, it's not,
due to the large sized GPU register file (128 entries.  actually 2x
128 - 128 FP,
128 INT).

we may have to *down*-scale via a regfile cache to cope with that.
discussion for another time.

> Intel cores are power-hungry because
> historically they have targeted the desktop/workstation market, and do
> not have as many decades of experience as other firms optimizing for
> power-constrained environments.

appreciated the insight.

> Also almost all modern OOO cores use Tomasulo because Tomasulo is
> better at avoiding false hazards compared to scoreboarding.

in the versions of scoreboards implemented by academics - because they
fail to comprehend them properly - because they only read the patent
on Q-Tables - yes.

in the version that Mitch Alsup - a world leading expert on Scoreboard
Mechanics - taught me:

no.

also, the IIT Madras RISE Group Shakti E-Class processor is *also* a
scoreboard design.  this was designed "from first principles" by Professor
Kamatoti, and it was only through learning from Mitch that i was able
to recognise what he had done as being identical to a 6600 scoreboard.

> Tomasulo
> also makes it easier to implement precise traps, something which is
> notoriously challenging  with just a scoreboard.

no it is not.  this is *yet again* complete factual misinformation and ignorance
from Academic literature.

can i ask: have you read Patterson's book?  this is the usual place where
most people get their misinformation from.  the release of this misinformation
was why Mitch Alsup - a very pissed-off and exasperated Mitch - wrote those
two chapters, back around... 2008 (i think).

precise exceptions, traps, interrupts, branch speculation and predication may
all be achieved through "shadowing".  see the latter part of the 2nd chapter of
Mitch's book.

see the section "Shadowing" in this page:
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/

it's very simple: you hold write until such time as the opportunity for "damage"
is cleared.  Operand Forwarding allows instructions to run ahead, using results
that would otherwise have to be received only first by going through
the Regfile.

*this is already implemented* and the unit tests demonstrated it as
*fully functional*

> Lets be clear here. The CDC6600 AVOIDS WAR and WAW hazards by BLOCKING
> younger instructions (See pg3 in Mitch Alsup's text). Tomasulo's
> supports OOO execution PAST WAR and WAW hazards, enabling the core to
> exploit more ILP across its functional units.

this is only because the 1965 CDC6600 did not have an Operand Forwarding Bus
(instead it used the regfile as a same-clock forwarding bus, by reading on one
edge and writing on the other.  this unfortunately meant that without an OpFwd
Bus, stall had to occur).

this DOES NOT INHERENTLY MEAN that **ALL** Scoreboards are rubbish.

if you add an Operand Forwarding Bus (equivalent to the CDB in Tomasulo)
the exact same parity functionality is achieved (without the severe downsides).

> Luke, I agree with you that the reservation station rows in Tomasulo's
> can be equivalent to the operand latches in Scoreboarding. However,
> adding many operand-latches (by having many functional units) does not
> scale as nicely as just adding a few more rows to Tomasulo's
> reservation stations.

this is incorrect.  you may be under the mistaken impression that the pipelines
duplicated as well.  this clearly would be inadviseable because the FU's job
is to serve one and only one result.  therefore you know that this is a mistaken
understanding, and that consequently you have multiple FUs for the same
pipeline.  (termed "Concurrent Computation Unit").

therefore it's minimal.  excluding the latches themselves (the rows
have latches, the
CompUnits - aka FunctionUnits have latches, the total number of latches after
flattening is the same), the number of gates added by each CompUnit is
minimal: around... 80, maximum.

that's peanuts.

> Especially since designing the CAMs for
> Tomasulo's is a well studied, and well-understood problem

the key is the combination of those two:

A) well-understood... *and*
B) a problem.

have a look at how multi-issue is done using Tomasulo.  it's shockingly bad.

> > 3. Expand the number of RSes so that if you were to count the total number
> > of places operands are stored, they are the same.
>
> > (another way to put this is, "flatten all 2D RSes into 1D")
>
> This does not scale well. A 2D CAM is far more area efficient than a
> 1D flattened vector.

a 1D bitvector.  a single, 1D bitvector.  of 32 bits in length.

compared to 32 5-bit binary addresses.

just the latches alone, that math does not stack up.

now let's take it to 2D.

let's say that we have 10 FUs and 32  Regs.

in Tomasulo that is a 5-bit address.  10x5 = 50 DFF latches.  DFF
latches are 10 gates so that's 500 gates.

in 6600 that is 10x32 = 320 SR NAND Latches.  SR NAND latches are 2
gates.  320x2 is 640 gates.

not very different in terms of latch count, is it?

now let's do the address-checking.

to check a register as "active" - bear in mind that this is per
register and consequently
if doing multi-issue that's a batch of address-checking gates *per register*.

for Tomasulu, this requires 10x5 = 500 XOR gates (plus some
multi-input AND gates on each row)   XOR is 4 gates, so that's 2,000
gates *PER MULTI-ISSUE LANE*.

for the 6600, that's 10x32 **AND** gates.  320 AND gates.

and of those 320 AND gates, they are re-useable for multi-issue
WITHOUT REQUIRING ADDITIONAL GATES.

so that is 320 AND gates for single-issue
that is zero extra gates for dual-issue
that is zero extra gates for quad-issue
zero extra gates for octal-issue.

start to make sense now, why unary encoding as 1D vectors is less
gates and less power than binary CAMs?

if you want to check multiple unary bits (a critical prerequisite for
multi-issue), all you do is... set... multiple... bits.

what happens?  multiple AND gates fire.  in the same clock.   and
that's exactly what we want.

> > 7. rename RSes to "Function Units" (actually in Thornton's book the phrase
> > "Computation Units" is used)
>
> So a Tomasulo's design with 4 rows now becomes a "flattened" design
> with 4 functional units? The additional cost of the functional units
> is enormous.

no, it's 4 x 80 gates (approx).  that's not 4x duplication of ALU
pipelines *behind* those 4 Function Units.  see Mitch's 2nd Chapter,
"Concurrent Computation Unit"

https://libre-soc.org/3d_gpu/architecture/6600scoreboard/600x-concurrent_comp_unit.png

Jeremy: it's rather unfortunate timing.  six to twelve months ago, we
could have had this debate and it would have been fun, informative,
useful and productive.  it would also have been possible to bring you
in to the discussions that took place (Nov 2018 to around Apr-May
2019) between myself, Samuel Falvo, Mitch Alsup and the team either
here on this list, or on comp.arch.

however with the upcoming deadline, this kind of otherwise really
valuable discussion is, sadly, something that we *do not have time
for*.

retrospective recommendation that we reconsider and use Tomasulo would
not only result in us completely missing the upcoming Oct 2020
deadline, we would end up with an exceptionally poorly-performing,
limited and seriously problematic processor as well.  worse than that,
sadly i have to say that i am concerned that any retrospective
discussion would itself (aside from valuable corrections and
assistance in creating a comparative analysis) become a serious
distraction.

it's liitterallly taken 18 MONTHs to get to the point where we are now
with the code and the design.  we need to finish the design in a very
short timeframe (14 weeks), and as Yehowshua points out, due to the
complexity of this particular area of Computer Science, that in turn
implies that you'll need to trust me on the architectural analysis
(which took 5 months to complete, under Mitch's guidance).

so, my apologies for needing to be firm on this: the opportunity to
help exists within the time-frame and financial constraints that were
determined and set, over 18 months ago, before you joined.

therefore, if you'd like to help, then asking questions which are
related to *understanding* of the precise-augmented Scoreboard system,
targetted very specifically at allowing you to engage and assist us in
meeting that deadline, are very welcome.

anything other than that very specific goal-orientated focus, i am
really sorry, we really need to move it to one side, and for that, i
recommend comp.arch.

therefore, if you would like to discuss the merits of Tomasulo vs
Scoreboard (particularly those that pursue recommending Tomasulo
*over* Scoreboard) further, i recommend doing so on comp.arch, which
is where Mitch Alsup most regularly may be found.  with the right
interesting question, both he and many others will pitch in.

the one thing that you need to be aware of is that Mitch is both
hyper-intelligent, and second that he does not tolerate people who
cannot keep up with him, especially if they try to convince him of
something without having first demonstrated that they've understood
*him*.  he's well over 70, he's retired, he made enough "xxxx-you"
money on selling GPU patents to Samsung, so he doesn't have to put up
with anything that he doesn't want to :)

l.