[libre-riscv-dev] Scoreboards

Luke Kenneth Casson Leighton lkcl at lkcl.net
Mon May 13 19:10:23 BST 2019


On Mon, May 13, 2019 at 6:20 PM Samuel Falvo II <sam.falvo at gmail.com> wrote:

> I've been reading up on implementing scoreboards from various sources,
> especially how things were done in the CDC 6600.  It has gotten me to
> think of a few things, some plus some minus.
>
> 1) One of the big concerns with the KCP53000(B) is the size of the
> finished design, and there seems to be two reasons for it: the CSRU,
> which I can't do much of anything about, and the ALU, which I
> originally built 6502/Z80-style, where you have one monolithic chunk
> of logic that does everything.  It implements a barrel shifter and
> whatever adder circuit the Verilog synthesis tool deems appropriate.
> I'm wondering if there's value in removing the shifter and making it a
> slower computational unit (narrower shifts per cycle, but using
> multiple cycles to complete a full 63-bit shift).  Since the iCE40HX8K
> uses ripple-carry for its adder logic, I'm wondering if there's value
> in isolating _that_ as well (e.g., using a 16-bit adder circuit over
> four cycles to accomplish a 64-bit add), and perhaps making some
> future 530x0 design variant which is concurrent to support more rapid
> clock speeds but still hiding enough latency to simultaneously allow
> me a smaller design while giving the illusion of (almost?) consistent
> and rapid throughputs (whatever that may be).  Basically, if I split
> the single IXU into three or four FUs (you figure one for load/store
> as well), will this reorganization of the logic combined with the
> scoreboard needed to drive it result in a smaller overall design than
> just having one monolithic IXU?


preeetty much, yeah.  firstly: it turns out that [correct] implementations
of a scoreboard - the entire scoreboard - are a lot less in gates than any
one FU.

secondly: remember that the 6600 was very specifically from an era where
the number of gates (transistors) really mattered.  a huge amount of effort
went into getting the highest bang-per-buck ratio that they could get.
using SR-NOR Latches instead of DFFs for example.

thirdly: as i mentioned before, no matter what you do in a standard
single-issue pipeline design, you still have these hard choices to make
where you are forced to add in pipeline stalling (to deal with cases where
a register is being written to and the next instruction needs that answer),
you're forced to mask interrupts or cause pipeline stalls during LD/ST
operations, and all sorts of other awful "messes", *all of which go away
and are dealt with in one place* in a 6600-like design.

fourthly: the additional awfulness - the hoops that have to be jumped
through - associated with adding in FSM-based ALUs into a single-issue
pipelined design - that too goes away, because a FU-style front-end on the
FSM-based ALU is treated no differently from a pipelined ALU as far as the
scoreboard is concerned.

This is an open question that I don't
> know the answer to.
>
> 2) I've read up on Q matrices at least five times now.  I'm still
> *utterly* confused.  Continuous scoreboards make 1000% more sense to
> me.
>

 :)

 thornton notes that whilst the actual amount of logic involved is far
smaller than any of the ALUs, the timing-critical nature of a scoreboard
is... well, as i'm finding out on hw-dev, extremely mind-bending.

there's at least four separate loops that i've been able to identify:

* instruction issue generates individual Function-Unit "issue" signals,
which are gated (disabled) by "busy" signals that come back from each FU
(on a clock delay so that you don't get a combinatorial loop).

* Function Units themselves are blocked by *both* detection of whether the
operands are ready (this is the "Q" table btw) *and* whether the result
register is safe to overwrite (because another FU needs the contents that
haven't yet been written) - this *again* on *another* loop because the FUs
need to know if *other* FUs have reservations outstanding...

* Even if FUs have results ready, you can't just blithely smash all the
results onto the output bus because the Register File may only have 1 write
port, so there is *another* signal (request release) that goes into a
priority picker (to choose one and only one FU at a time to gain the right
to write to the Regfile), and that is *also* a loop.... oh and there's
obviously a priority-picker for reading as well

now, the thing you do have to bear in mind is as follows:

* mitch's analysis of Thornton's work, he took the original 6600 design
(Thornton P127 through to 135), and then, rather than store the Function
Unit numbers in *binary* (see right-hand-side of Thornton Figure 76),
decided to store them as *UNARY* (one bit only allowed to be set in a 2^N
array of gates).

* this you can see in the diagram of mitch's book 10.5, at the top of page
23.  Thornton has those rows with a binary *NUMBER* to represent the
Function Unit Number (one for the destination, one for src1, one for src2),
and those numbers are therefore in a *ONE DIMENSIONAL* array of triple
(binary) numbers, where mitch translated that to a TWO dimensional array of
triple-bits (three latches).

* you can see the cell on the next page (p24), containing the triple
latches, one for dest, one for src1, one for src2.

* later, in section 10.7, mitch explains why he did this: it saves gates...
*under certain design conditions*.  when you have the Function Unit numbers
in binary, you have to encode and decode them (re-translate them from
binary to unary) to perform those big "OR" gates.  Mitch's design *avoids*
the need for the encoder-decoders, however it does so at the expense of
turning an O(N) problem into an O(N^2) problem, and at some point,
Gate_Count(Decoders) + Gate_Count(O(N)) vs O(N^2) swings the other way.

most modern scoreboard designs, based on academic literature (as opposed to
the original 6600 analysis) have not done this kind of in-depth analysis,
and just go with the *original* 6600 design, using a 1D array of binary
Function Unit #s.

however, most modern scoreboard designs also do not have the "precise
exception" capability that Mitch added in (towards the tail end of the 2nd
chapter, see "Branches").


> 3) I'm surprised at how my independent design fairly closely matches
> Thornton's FU design.  The "three cycles" of my design are, literally,
> operand-fetch, execute, register-writeback, corresponding roughly to
> issue, go-read, and go-write, in that order.  I can now understand why
> Luke wanted to bring Thornton's work to my attention.  It doesn't see
> like *that* big of a leap from my current plans to a superscalar
> design.
>

it really isn't.  it's worth mentioning again that a degenerate 6600-style
system with just a single pipelined ALU and multiple FU front-ends is
basically a standard "single-issue" pipelined processor.

i'm going to keep bashing away at the timing issues until i have something
that actually works, and can deal with one instruction per clock without
completely keeling over.

at that point i can start to make an analysis of whether mitch's simplified
design is workable, and also whether the design would be better off with
unary or binary register numbers.

if registers are to be done with standard memory cells (2R1W), particularly
on FPGAs, binary numbers may be the better choice, even if it would cost
more gates.

one of the downsides of going with an unary 2D matrix is: if you want to
use standard (addressable) SRAM, you have to *RE-ENCODF* the damn unary
register number *BACK* into to a binary address.

l.


More information about the libre-riscv-dev mailing list