[libre-riscv-dev] Scoreboards

Tue May 14 07:03:35 BST 2019

On Mon, May 13, 2019 at 11:00 PM Samuel Falvo II <sam.falvo at gmail.com> wrote:
>
> On Mon, May 13, 2019 at 11:11 AM Luke Kenneth Casson Leighton
> <lkcl at lkcl.net> wrote:
> > On Mon, May 13, 2019 at 6:20 PM Samuel Falvo II <sam.falvo at gmail.com> wrote:
> > preeetty much, yeah.  firstly: it turns out that [correct] implementations
> > of a scoreboard - the entire scoreboard - are a lot less in gates than any
> > one FU.
>
> I tried my hand at conceiving a FU (simple binary adder) design on
> paper (assuming an FPGA-friendly design, so no latches[1]; includes
> both function unit and computation unit functionality, since for the
> purposes of exploration/experimentation, it makes more sense to keep
> them together), and what I ended up with, I think, was basically a
> single reservation station Tomasulo FU.

 perfect!  that's pretty much exactly what a 6600 Function Unit is.
what advocates of the Tomasulo algorithm miss is:

 * the multi-row nature of TS RS's has a consequential (indirect)
knock-on effect that makes the use of a CAM (indirectly) mandatory

 * the ROB Dest Reg# has to be a CAM because of not one but *TWO* things:

(a) the multi-row (one-to-many) nature of TS RS rows and
(b) the ROB instruction issue ordering by using a row *number*.

the 6600 instead can have the instruction order preserved as a
bit-wise linked list of write-dependencies, overloading the FU-FU
bit-matrix to do so.  i.e. if each instruction is given an (apparently
"unnecessary") write-dependency, no instruction may COMMIT
out-of-order, yet it may produce RESULTs out-of-order.

so in the Tomasulo algorithm, by making the TS RS's multi-row, the
opportunity to remove the CAM and save gates *and significantly reduce
power* was entirely missed.

> Two of the three RSFFs in my design correspond to the Vj and Vk
> (valid) bits for Qj and Qk, respectively, so I recognize that, at
> least.  These flags are set by 5-bit comparators (e.g., when busy & Rj
> == the register ID on the writeback bus, then set the Vj flag and
> register the writeback data).

 so i deduce that you chose to stick with binary register numbers
instead of converting them to unary?

> The remaining RSFF is the "instruction
> loaded/busy" flag.  The timing chain kicks off only when busy & Vj &
> Vk are true.  When RETIRE_FU is asserted, the three RS latches are
> reset, bringing the unit back to a quiescent state.

 cool.

> The FU had the following control signals:
>
> 1) ISSUE_FU - Loads the current instruction word into the FU's local
> instruction register, and sets the BUSY_FU flag in the next cycle
> (RSFF) that goes back to the scoreboard and gates other FU activities.
> 2) BUSY_FU - True if the FU is in use.
> 3) WBD - Write-Back Data bus.
> 4) WBS - Write-Back Select (which register it's writing back to).
> 5) WBVALID - True if data on the write-back bus is valid (basically, a
> logical OR of all RETIRE_FU signals).
> 6) RETIRE_REQUEST_FU - Asserted some time after both operands have
> been filled (somehow).
> 7) RETIRE_FU - Resets all the RSFFs, and drives this FU's results on on WBD/WBS.
> 8) RETIRE_RS - Register select (to appear on WBS when the time is right).
> 9) RETIRE_D - Register writeback data (to appear on WBD when the time is right).

hmmm if the names were those that are used in mitch's book chapters
i'd have an easier time understanding.  also, mitch himself could
comment.

> I'm assuming that the register selects captured in the local
> instruction register when the ISSUE_FU signal was true will also feed
> back into the scoreboard /somehow/, so that it knows which registers
> this FU is reserving.

 yes.  the *individual* register selects - which i assume you are
doing in binary - need to be converted to UNARY, and then the UNARY
versions are ORed together.

 this for both read reservations and write reservations.

 those both then go each through a priority-picker (one each for read
and write, separately), and that's how you know to pick one AND ONLY
one Function Unit to be allowed to read (access to the READ ports of
the regfile), and one AND ONLY one to be allowed to write (access to
the WRITE ports of the regfile).

[right now, the code that i'm working on, FU #1 has somehow been
granted permanent access to the regfile read port: readable is set
stuck high for FU #1.  *sigh*]

btw obviously, for writing, you clearly don't want to write to the
regfile when the result isn't actually ready!  so, you must AND the
FU's "writable" signal with the Computation Unit's "Done" signal
(called "Request Release"), and *then* pass a batch of those into a
priority picker.

hypothetically.... and i am speaking only hypothetically, you _could_
keep that all in binary.  if you were prepared to perform some sort of
multi-input monstrosity:

* src1 read register: MIN(FU1_src1, MIN(FU2_src1, MIN(FU3_src1, FU4_src1)))

something like that... except... yeah, yuk.  however, it illustrates
the point, that you want one *and only* one Function Unit at a time to
get access (separately and independently!) to the regfile read port
and regfile write port.

> This would work for register/register operations where the FU is
> blocked waiting for the source operands to be committed; anything
> involving immediates or register operands which are not blocked would
> need additional elaboration of the design (and probably more
> issue-time signaling), but I think it's more or less easily expanded
> to support that case.

 yes, see diagram from 10.4.7 (p19), pretty clear-cut.  just pass in
the immediate in place of the operand, the latches take care of
keeping it there on behalf of the CU, and the scoreboard obviously
doesn't need a reservation for a register which isn't used.

 however, take care with timing, obviously, and notice how, when
compared to "non-immediate-using" CUs, the *opcode* is further decoded
(assuming that the opcode says "part of the instruction is an
immediate") and HOLDS the latch containing the immediate in a
DIFFERENT way from the operand-that-is-not-an-immediate.

so there's 4 SR-latches needed rather than 3, because operand 1
(src1/imm) is treated differently, timing-wise, from operand 2 (src2
only, no immediate)

> True, but I'd argue that the FPGA I'm working with is much tighter
> than what Cray/Thornton had to work with.  I'm limited to the
> equivalent of about 5000 2, 3, or 4-input gates.  I do have the
> advantage that each "gate" has a corresponding DFF associated with it.

 see attached SR-Latch design (one sync, one async), it's not exactly
an SR-Latch, because S takes priority over R.  i'd like to do better
than that.

> ECP5K would ease this limitation significantly, of course.

 i'd argue that the cost of the ECP5K dev boards are well within the
target budget of someone wanting to do a home computer simulation.

> ________
> 1.  As convenient as I find working with FPGAs to be, I really miss
> the freedom of being able to use latches and level-sensitive logic.
> Example: exploiting NOR-type RS latch behavior and gate delays to /in
> effect/ function identically to an edge-triggered DFF is just #@#^ing
> brilliant.  A lot of what made the 6502 work the way it did was
> extensive use of latches and level-sensitivity.  Come to think of it,
> I'm not aware of anything inside the 6502 that was edge-sensitive,
> which is what I think set it apart from the 6800, which you might
> recall, required an input clock 4x the desired bus clock to function
> at all.

 whoops, i didn't know that.

> With edge-triggered designs, the best you can do to simulate
> it is to drive your clock at (at least) twice the desired rate and
> alternate explicitly between two different phases.

 or go through some preeetty tough shenanigens, setting some
tough-to-follow timing rules, that data and a "data valid" signal must
be "obeyed", and that if you don't *want* to process some data, you
had to say so IN THE PREVIOUS CYCLE.

 dan gisselquist describes this well in "buffered handshake"
https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html

 however, we found, after an extensive examination, that, actually, a
well-written SyncFIFO (such as the one in nmigen) follows and respects
these required interaction rules, precisely.

 which is quite fascinating.

> Perhaps no
> processor is worse at this than the original Intel 8051 family, which
> required a whopping 12 clocks per machine cycle!  Bleh!  But I
> digress.

 :)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2019-05-14_06-29.jpg
Type: image/jpeg
Size: 21614 bytes
Desc: not available
URL: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/attachments/20190514/8be4e721/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2019-05-14_06-44.png
Type: image/png
Size: 14642 bytes
Desc: not available
URL: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/attachments/20190514/8be4e721/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2019-05-14_06-57.jpg
Type: image/jpeg
Size: 17837 bytes
Desc: not available
URL: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/attachments/20190514/8be4e721/attachment-0003.jpg>