[libre-riscv-dev] Scoreboard vs Tomasulo

Sat May 16 23:45:17 BST 2020

I have finally finished parsing Mitch's chapters, and now I see what
the primary confusion is from. The scoreboard being developed here has
been referred to in the mailing lists as a "CDC 6600-like" scoreboard,
while in reality, the final design is far extended far beyond the
original CDC 6600 implementation. I would suggest not calling it a
"CDC 6600-like scoreboard" in the future, instead a "extended
scoreboard" is more accurate.

Let me try to compose my thoughts.
In order for OOO execution to be possible, write data from
speculatively executed instructions must be buffered, for in the case
of misspeculation, they must not be allowed to commit into the
architectural register file.

In Tomasulo's, speculative write data are held in physically renamed registers.
In the extended scoreboard scheme, speculative write data are held in
compute unit latches.
In Tomasulo's, branch dependency masks are used to drain misspeculated
instructions.
In extended scoreboard, the misspeculated instructions drain
themselves from the system.
Both schemes can be designed to achieve similar OOO behaviors.
Thus, one cannot draw a performance conclusion just by seeing which
scheme is being used.

Ideally the above text should be uncontroversial, please let me know
if there is a large misunderstanding here.

My opinion is that Tomasulo's is a more elegant and well-proven scheme
for achieving OOO execution, compared to an extended scoreboard. I
also argue that the extended scoreboard scheme does not present
significant power/area advantages compared to Tomasulo's. Given that
estimates on power/area tend to be wildly inaccurate, I think we will
have to wait for power analysis of LibreSoC to get a definitive answer
to this debate :).

I deeply apologize if any of my thoughts could be interpreted as
criticism. I fully support this kind of exploration in OOO design. I
also apologize if I seem to suggest a change in design, as I
understand that hardware design is not nearly as flexible as software
design.

On Sat, May 16, 2020 at 2:38 PM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>
> On Sat, May 16, 2020 at 9:16 PM Jeremy Singher <thejsingher at gmail.com> wrote:
>
> > > However, the scoreboard is limited in that it does not handle WAR and WAW
> > > hazards very well.
> >
> > > the original 6600 handles WAR extremely well, only stalling on WaW
> > > condition, detection which did not matter greatly because the pipelines
> > > were only at most 2 stages long anyway (Mitch only noticed after rereading
> > > last year that the FP ADD of the 6600 was 2 stage pipelined.  no academic
> > > literature has acknowledged or noticed this).
> >
> > I would argue that the CDC6600 handles WAR better than an in-order
> > machine, but not better than a machine with full renaming.
>
> Jeremy: 6600 designs *do* have register "renaming".  i said already,
> a number of times.  and if you have studied and listened to the Academic
> literature - and believe it - you are misinformed by that literature, and will
> continue to be misinformed by that literature.
>
> please understand: i have spent over nine months studying this, learning
> from Mitch Alsup, one of the world's leading high-performance computing
> architects, implementing this and it's not going to get chucked out to
> implement something that takes six MONTHS to fully understand is a
> better design, to be replaced by something that would take another six
> months to implement *and we then could not do multi-issue on top of it*.
>
> please therefore take the time to understand this design, particularly when
> we are right in the middle of a time-critical deadline.
>
> > In the
> > CDC6600, the scoreboard would block the completion of the W inst until
> > the R inst has read its operands. In Tomasulo, the W inst can proceed
> > immediately before the R inst is ready, since the destination register
> > is physically renamed.
>
> Jeremy: the "renaming" carried out by the 6600 does exactly this same
> job.
>
> Academics unfortunately have *only* studied the *patent* - not Thornton's
> book nor read Mitch Alsup's book chapters which he wrote based on 30+
> years experience of implementing 6600-style execution in the 88100,
> AMD K9 and AMD Opteron Series of CPUs.
>
> the 6600 patent *ONLY* covers the Q-Tables.
>
> the description of the Q-Tables does *NOT* include the Computation Units
> where the operand latches are.
>
> these latches *ARE* the "renamed registers" (actually, nameless registers).
>
> > >  ROB is a CAM,
> >
> > What? This is not true. Instructions in flight track their index in
> > the ROB, at completion, they write success bit according to the index
> > they have. Not a CAM. Simple addressed write.
>
> ah appreciated the correction.  i knew there were CAMs somewhere.
> when i discovered that, i dropped further investigation.
>
> i am however aware that if you wish to do multi-issue, the ROB will need
> to be multi-ported SRAM.  without multi-porting the SRAM, you simply
> cannot write multiple entries.
>
> this is where the problems (read: nightmares) for Tomasulo start to manifest,
> and progress from there through multiple Common Data Bus channels into
> multi-ported (NxR, NxW) CAMs at the Reservation Station level.
>
> i've been through this, ok, over 18 months ago.
>
> by contrast, when converted to single-bit unary vectors, each entry in the
> Matrices (covered by a single SR NAND latch) is *independent*.  multiple
> of those may be set in a single cycle.
>
> there has been a *massive* amount of thought gone into this, ok?
>
>
> > > given that Intel processors use Tomasulo, we start to see why Intel
> > > processors suck so much power.
> >
> > Power consumption on a processor is primarily from high switching
> > rates on large register files, and SRAM leakage.
>
> given that we're planning around 28 Function Units and have a 128-entry
> register file, the calculations that i did for that would put the Dependency
> Matrices around a whopping 50,000 gates, using SR NAND Latches.
>
> (if we did it as DFFs it would be an unacceptable 250,000 gates)
>
> > The scheduling logic is a miniscule portion.
>
> not for the large-scale out-of-order systems that we are planning to
> do, it's not,
> due to the large sized GPU register file (128 entries.  actually 2x
> 128 - 128 FP,
> 128 INT).
>
> we may have to *down*-scale via a regfile cache to cope with that.
> discussion for another time.
>
>
> > Intel cores are power-hungry because
> > historically they have targeted the desktop/workstation market, and do
> > not have as many decades of experience as other firms optimizing for
> > power-constrained environments.
>
> appreciated the insight.
>
> > Also almost all modern OOO cores use Tomasulo because Tomasulo is
> > better at avoiding false hazards compared to scoreboarding.
>
> in the versions of scoreboards implemented by academics - because they
> fail to comprehend them properly - because they only read the patent
> on Q-Tables - yes.
>
> in the version that Mitch Alsup - a world leading expert on Scoreboard
> Mechanics - taught me:
>
> no.
>
> also, the IIT Madras RISE Group Shakti E-Class processor is *also* a
> scoreboard design.  this was designed "from first principles" by Professor
> Kamatoti, and it was only through learning from Mitch that i was able
> to recognise what he had done as being identical to a 6600 scoreboard.
>
> > Tomasulo
> > also makes it easier to implement precise traps, something which is
> > notoriously challenging  with just a scoreboard.
>
> no it is not.  this is *yet again* complete factual misinformation and ignorance
> from Academic literature.
>
> can i ask: have you read Patterson's book?  this is the usual place where
> most people get their misinformation from.  the release of this misinformation
> was why Mitch Alsup - a very pissed-off and exasperated Mitch - wrote those
> two chapters, back around... 2008 (i think).
>
> precise exceptions, traps, interrupts, branch speculation and predication may
> all be achieved through "shadowing".  see the latter part of the 2nd chapter of
> Mitch's book.
>
> see the section "Shadowing" in this page:
> https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
>
> it's very simple: you hold write until such time as the opportunity for "damage"
> is cleared.  Operand Forwarding allows instructions to run ahead, using results
> that would otherwise have to be received only first by going through
> the Regfile.
>
> *this is already implemented* and the unit tests demonstrated it as
> *fully functional*
>
>
> > Lets be clear here. The CDC6600 AVOIDS WAR and WAW hazards by BLOCKING
> > younger instructions (See pg3 in Mitch Alsup's text). Tomasulo's
> > supports OOO execution PAST WAR and WAW hazards, enabling the core to
> > exploit more ILP across its functional units.
>
> this is only because the 1965 CDC6600 did not have an Operand Forwarding Bus
> (instead it used the regfile as a same-clock forwarding bus, by reading on one
> edge and writing on the other.  this unfortunately meant that without an OpFwd
> Bus, stall had to occur).
>
> this DOES NOT INHERENTLY MEAN that **ALL** Scoreboards are rubbish.
>
> if you add an Operand Forwarding Bus (equivalent to the CDB in Tomasulo)
> the exact same parity functionality is achieved (without the severe downsides).
>
>
> > Luke, I agree with you that the reservation station rows in Tomasulo's
> > can be equivalent to the operand latches in Scoreboarding. However,
> > adding many operand-latches (by having many functional units) does not
> > scale as nicely as just adding a few more rows to Tomasulo's
> > reservation stations.
>
> this is incorrect.  you may be under the mistaken impression that the pipelines
> duplicated as well.  this clearly would be inadviseable because the FU's job
> is to serve one and only one result.  therefore you know that this is a mistaken
> understanding, and that consequently you have multiple FUs for the same
> pipeline.  (termed "Concurrent Computation Unit").
>
> therefore it's minimal.  excluding the latches themselves (the rows
> have latches, the
> CompUnits - aka FunctionUnits have latches, the total number of latches after
> flattening is the same), the number of gates added by each CompUnit is
> minimal: around... 80, maximum.
>
> that's peanuts.
>
> > Especially since designing the CAMs for
> > Tomasulo's is a well studied, and well-understood problem
>
> the key is the combination of those two:
>
> A) well-understood... *and*
> B) a problem.
>
> have a look at how multi-issue is done using Tomasulo.  it's shockingly bad.
>
>
> > > 3. Expand the number of RSes so that if you were to count the total number
> > > of places operands are stored, they are the same.
> >
> > > (another way to put this is, "flatten all 2D RSes into 1D")
> >
> > This does not scale well. A 2D CAM is far more area efficient than a
> > 1D flattened vector.
>
> a 1D bitvector.  a single, 1D bitvector.  of 32 bits in length.
>
> compared to 32 5-bit binary addresses.
>
> just the latches alone, that math does not stack up.
>
> now let's take it to 2D.
>
> let's say that we have 10 FUs and 32  Regs.
>
> in Tomasulo that is a 5-bit address.  10x5 = 50 DFF latches.  DFF
> latches are 10 gates so that's 500 gates.
>
> in 6600 that is 10x32 = 320 SR NAND Latches.  SR NAND latches are 2
> gates.  320x2 is 640 gates.
>
> not very different in terms of latch count, is it?
>
>
> now let's do the address-checking.
>
> to check a register as "active" - bear in mind that this is per
> register and consequently
> if doing multi-issue that's a batch of address-checking gates *per register*.
>
> for Tomasulu, this requires 10x5 = 500 XOR gates (plus some
> multi-input AND gates on each row)   XOR is 4 gates, so that's 2,000
> gates *PER MULTI-ISSUE LANE*.
>
> for the 6600, that's 10x32 **AND** gates.  320 AND gates.
>
> and of those 320 AND gates, they are re-useable for multi-issue
> WITHOUT REQUIRING ADDITIONAL GATES.
>
> so that is 320 AND gates for single-issue
> that is zero extra gates for dual-issue
> that is zero extra gates for quad-issue
> zero extra gates for octal-issue.
>
> start to make sense now, why unary encoding as 1D vectors is less
> gates and less power than binary CAMs?
>
> if you want to check multiple unary bits (a critical prerequisite for
> multi-issue), all you do is... set... multiple... bits.
>
> what happens?  multiple AND gates fire.  in the same clock.   and
> that's exactly what we want.
>
>
> > > 7. rename RSes to "Function Units" (actually in Thornton's book the phrase
> > > "Computation Units" is used)
> >
> > So a Tomasulo's design with 4 rows now becomes a "flattened" design
> > with 4 functional units? The additional cost of the functional units
> > is enormous.
>
> no, it's 4 x 80 gates (approx).  that's not 4x duplication of ALU
> pipelines *behind* those 4 Function Units.  see Mitch's 2nd Chapter,
> "Concurrent Computation Unit"
>
> https://libre-soc.org/3d_gpu/architecture/6600scoreboard/600x-concurrent_comp_unit.png
>
> Jeremy: it's rather unfortunate timing.  six to twelve months ago, we
> could have had this debate and it would have been fun, informative,
> useful and productive.  it would also have been possible to bring you
> in to the discussions that took place (Nov 2018 to around Apr-May
> 2019) between myself, Samuel Falvo, Mitch Alsup and the team either
> here on this list, or on comp.arch.
>
> however with the upcoming deadline, this kind of otherwise really
> valuable discussion is, sadly, something that we *do not have time
> for*.
>
> retrospective recommendation that we reconsider and use Tomasulo would
> not only result in us completely missing the upcoming Oct 2020
> deadline, we would end up with an exceptionally poorly-performing,
> limited and seriously problematic processor as well.  worse than that,
> sadly i have to say that i am concerned that any retrospective
> discussion would itself (aside from valuable corrections and
> assistance in creating a comparative analysis) become a serious
> distraction.
>
> it's liitterallly taken 18 MONTHs to get to the point where we are now
> with the code and the design.  we need to finish the design in a very
> short timeframe (14 weeks), and as Yehowshua points out, due to the
> complexity of this particular area of Computer Science, that in turn
> implies that you'll need to trust me on the architectural analysis
> (which took 5 months to complete, under Mitch's guidance).
>
> so, my apologies for needing to be firm on this: the opportunity to
> help exists within the time-frame and financial constraints that were
> determined and set, over 18 months ago, before you joined.
>
> therefore, if you'd like to help, then asking questions which are
> related to *understanding* of the precise-augmented Scoreboard system,
> targetted very specifically at allowing you to engage and assist us in
> meeting that deadline, are very welcome.
>
> anything other than that very specific goal-orientated focus, i am
> really sorry, we really need to move it to one side, and for that, i
> recommend comp.arch.
>
> therefore, if you would like to discuss the merits of Tomasulo vs
> Scoreboard (particularly those that pursue recommending Tomasulo
> *over* Scoreboard) further, i recommend doing so on comp.arch, which
> is where Mitch Alsup most regularly may be found.  with the right
> interesting question, both he and many others will pitch in.
>
> the one thing that you need to be aware of is that Mitch is both
> hyper-intelligent, and second that he does not tolerate people who
> cannot keep up with him, especially if they try to convince him of
> something without having first demonstrated that they've understood
> *him*.  he's well over 70, he's retired, he made enough "xxxx-you"
> money on selling GPU patents to Samsung, so he doesn't have to put up
> with anything that he doesn't want to :)
>
> l.
>
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev