[libre-riscv-dev] Scoreboard vs Tomasulo

Sun May 17 01:53:34 BST 2020

On Sat, May 16, 2020 at 11:45 PM Jeremy Singher <thejsingher at gmail.com> wrote:
>
> I have finally finished parsing Mitch's chapters, and now I see what
> the primary confusion is from. The scoreboard being developed here has
> been referred to in the mailing lists as a "CDC 6600-like" scoreboard,
> while in reality, the final design is far extended far beyond the
> original CDC 6600 implementation.

yes.  well...  far beyond it in *functionality*, but not that far in
terms of gate
count [caveat: if we were only doing the same 3 regfiles, A B and X.
as we are doing 32x INT, 32x FP, plus SPR regs *and* Condition Regs
all under the control of Dependency Matrices, a direct comparison is unfair]

oh.  ah.  one definite enhancement: in the diagrams you see in Mitch's
chapters, there is only one GO-READ and only one GO-WRITE.  one GO-READ
indicates a desire to request *multiple* Regfile Read ports.

because POWER9 has in some cases up to *five* incoming registers and
*three* outgoing registers (in different Regfiles - INT/FP, SPR, CR), we needed
to split those out into individual lines.  GO-READ1/READ-REQ1,
GO-READ2/READ-REQ2 etc.

currently i am struggling with this FSM which implements 3-operand in,
immediates, 2-operand out *and* manages the address, GO-ST, address
exceptions and deals with POWER9 LD/ST "Update" mode.

https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg

this i reaaaally need some help with.

> I would suggest not calling it a
> "CDC 6600-like scoreboard" in the future, instead a "extended
> scoreboard" is more accurate.

appreciated.  if you could consider helping change the wiki to reflect this (and
help us to maintain the advantages below) this would be fantastic.

i've used a range of terms - "augmented 6600 scoreboard", and others.
the historic
link and the effective revival of something long-forgotten which was
both elegant
and staggeringly gate-efficient (out of necessity): i don't want to use the word
"political" in any sentence however there is a considerable publicity
"wow factor"
to doing something as surprising as reviving a 55-year-old design.

since interacting with Mitch Alsup last year, you're the first additional person
to actually do any kind of significant analysis: consequently terminology has
not really been discussed... because there wasn't anyone to discuss it with!

> Let me try to compose my thoughts.
> In order for OOO execution to be possible, write data from
> speculatively executed instructions must be buffered, for in the case
> of misspeculation, they must not be allowed to commit into the
> architectural register file.

mis-speculation, LD/ST page-faults, predicated instructions where the operand
that contains the bitmask is in a register that has not been read yet, and even
simple external NMI interrupts: all of these need a means and method of
killing in-flight data, yes.

> In Tomasulo's, speculative write data are held in physically renamed registers.
> In the extended scoreboard scheme, speculative write data are held in
> compute unit latches.

these two effectively having a very low Levenshtein distance, but not quite
zero.

* in Tomasulo, the reason for the explicit renaming is down to the 2D
nature of RS's.
  hence the ROB row number needs to be stored in the RS

* in extended-scoreboard, the FU's index / column-number *is* that "name";
  with there only *being* a one-to-one path, there is no need for a
name *at all*,
  hence why i refer to the CU latches as "nameless registers".

  the ROB row number *is* the FU number which *is* the renamed-register-number
  and consequently by being one-to-one in every aspect, needs no name.

this aspect of 6600 scoreboards - the complete lack of naming and even the
lack of significance drawn to it even by James Thornton himself in his book,
and certainly in the patents, is why the 6600 is *believed* not to have register
renaming of *any* kind.

ah - one other really critical thing (i will add to the page):

* Tomasulo ROB Row numbers preserve instruction order through sequential
   numbering by the actual row number

* 6600 scoreboards preserve instruction order in the FU-FU Matrix with
  a NxN bit-level matrix of SR Latches implementing a "linked list" (the
  Directed Acyclic Graph) of ordered dependencies, using the *register*
  dependencies at the time the instruction is inserted into the *FU-REGS*
  Matrix to determine the FU-FU DAG.  this has more in common with a
  "linked list" data structure, to use software terminology.

this latter aspect is a crucial one that even James Thornton notes is really
*really* complex to explain, yet paradoxically requires an extraordinarily
small number of actual gates to implement.

however understanding the difference between the (binary) ROB Row
numbering (which preserves the DAG of instruction order by way
of binary-addressed instruction ROB Row N+1 being a greater index
than binary-addressed instruction ROB Row N) and this
"functionally-equivalent" linked-list system equally successfully preserving
a DAG but using unary SR Latches to do so, is key to recognising
and appreciating that Tomasulo and Scoreboards *really are* functionally
directly equivalent.

here it should become clear as to another aspect of why multi-issue is
problematic in Tomasulo.  where detection of "readiness to commit" in
a single-issue Tomasulo is a simple matter of checking *one* row
"is your result ready, if yes commit it and cyclically increment the pointer
to the ROB row to the next item, ready for test-or-commit on the *next*
cycle", this must be updated to testing of *multiple* row "readiness"
states, which clearly requires a multi-ported SRAM.

this is where "solutions" that i have seen actually stratify the ROB
into 4 (or N) interleaved banks.  crossbars are required... it's where
the nightmares begin.

> In Tomasulo's, branch dependency masks are used to drain misspeculated
> instructions.

are these unary-encoded or binary-encoded?  by masks i would expect them
to be unary-encoded and thus allow multiple bits to be set, allowing multiple
operations to be cancelled.

if not, and binary masks are used, then each unit must have an arithmetic
comparator, "is my binary index less than or equal to the branch dependency
mask, if less than, then this means i must die".

the Shakti 6600-like (extended) scoreboard system in their E-Class design
uses *binary* numbering, which means that they use that scheme.  effectively
the mask is not a mask per se, it is a shortened version of the instruction
sequential number.

BOOM uses unary bit-masks.  i think.  i had a brief conversation about it
with someone, somewhere, a few months back.

i greatly prefer the unary system.  it allows interleaving of multiple
simultaneous
cancellations to *automatically* sort themselves out, by ORing the cancellations
together to produce the mask!

this is an "accidental" side-effect of the Shadow Matrix concept,
described below.

a binary-indexed cancellation system requires a *central resource* to calculate
the lowest-numbered "index", followed by broadcasting that index across the
one (and only one) permitted cancellation datapath.

> In extended scoreboard, the misspeculated instructions drain
> themselves from the system.

ah no.  we'll be cancelling them explicitly, via an unary bit-mask,
that is global
and combinatorial in nature, extending right the way through the entirety
of the Function Unit, CompUnits *and* the associated pipelines.

the unary mask is basically a conversion of the *binary* indices of the
CompUnit number on any given Concurrent Computation Unit.  0-3
in this diagram here:
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/600x-concurrent_comp_unit.png

thus in that case the unary cancellation mask would be 4-bit wide.

these are linked directly to - and are equal to - the "GO_DIE" signals in *this*
diagram:
https://libre-soc.org/3d_gpu/shadow.jpg

note in that diagram the existence per cell in the Dependency Matrices of *two*
Shadow Latches.

the first Shadow Latch is for the first Branch Unit to indicate "speculative
execution underway as controlled and directed by Branch Unit 1".

the second Shadow Latch is for the *second* Branch Unit to do likewise.

a third Shadow Latch exists for predicated execution.  one bit each of the
predicate register (once read) will be thrown at the "FAILED-3" line if that
bit is zero, and at the "SUCCESS-3" line if that bit is 1.  if "failed" is
asserted, then so is the "GO_DIE".  that "GO_DIE", being connected to
the cancellation masks, *INSTANTLY* kills not just the register hazard
allocation, it resets the Function Unit to idle *and* kills the partial result
generation actually in the pipeline itself.

if "success" is asserted, the shadow latches are cleared, and "WRITE_REQUEST"
is no longer prevented and prohibited from asserting.  the result (if it was
ready), is allowed to proceed "I now request a Regfile Write Port".
if the result
happened not to be ready, then when the result *is* ready it will not
be held up.

a fourth Shadow latch exists for cancellation of operations in the "shadow"
of a LD/ST Unit's page-fault / address exception.

as you can see this requires a trivial number of gates per cell, however
it is *each* exception or cancellation opportunity that  requires these Shadows,
on a *per FU basis*, across all *other* FUs.

consequently i designed a "Shadow Matrix" that takes care of that.

> Both schemes can be designed to achieve similar OOO behaviors.
> Thus, one cannot draw a performance conclusion just by seeing which
> scheme is being used.

no.  hence why we haven't done that kind of analysis: we're focussing
on implementation.

> Ideally the above text should be uncontroversial, please let me know
> if there is a large misunderstanding here.

nono looks great, really appreciate the insights raised.  the missing piece
is the true significance of the role of the Shadow Matrices as being *the*
provision of precise exception capability, speculation, predication, everything.

yes i am aware that without "predication prediction" yes those are two
different words we end up with exactly the same situation as for Branch
Prediction: predicated operations that are issued, cause FUs to become
full / allocated, that are cancelled moments later because the majority
of the predicate register turns out to be full of zeros.

this is *not* a high priority item right now :)

> My opinion is that Tomasulo's is a more elegant and well-proven scheme
> for achieving OOO execution, compared to an extended scoreboard. I
> also argue that the extended scoreboard scheme does not present
> significant power/area advantages compared to Tomasulo's.

Mitch Alsup is better equipped, from 40 years experience at gate-level
design (which he kept and *refined* when working for AMD, hence some
of the blistering withering words to less experienced, less-knowledgeable
people trying to tell him that HDLs were inherently better), to answer this
one.

> Given that
> estimates on power/area tend to be wildly inaccurate, I think we will
> have to wait for power analysis of LibreSoC to get a definitive answer
> to this debate :).

if we have to use DFFs rather than SR NAND, the argument is instantly
won by Tomasulo because we have quite literally a 5-fold increase in
gate count right across the board, in every single Matrix.  Shadow,
FU-Regs, FU-FU - the lot.  everywhere you see an SR-Latch diagram
(2 gates), instead add 10 gates for a DFF.

for the 180nm ASIC this would jump approximately.... mmm... 10,000 gates
up to 50,000.

we may have to do that.  it will suck.

> I deeply apologize if any of my thoughts could be interpreted as
> criticism. I fully support this kind of exploration in OOO design. I
> also apologize if I seem to suggest a change in design, as I
> understand that hardware design is not nearly as flexible as software
> design.

appreciated, Jeremy.  if this discussion was taking place 6 to 12 months ago
the "mode" we would be in was radically different.  now is "full-bore
time-critical implementation mode", and will be until the other side
of Oct 2020.

l.