[libre-riscv-dev] Scoreboards

Tue May 14 18:53:53 BST 2019

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Tue, May 14, 2019 at 6:05 PM Samuel Falvo II <sam.falvo at gmail.com> wrote:
>
> On Mon, May 13, 2019 at 11:04 PM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
> > the 6600 instead can have the instruction order preserved as a
> > bit-wise linked list of write-dependencies, overloading the FU-FU
> > bit-matrix to do so.  i.e. if each instruction is given an (apparently
>
> I've been playing around with some mind-experiments.  It looks like one can
> give the 6502 a 50% to 100% performance boost by adopting a scoreboard and
> no fewer than 3 identical FUs to execute instructions with.  However, as
> I've been diagramming on the whiteboard, I've never had a need to record
> write dependencies or use an FU-FU matrix.

 if you have only combinatorial ALUs, that complete in one cycle, and
you have absolutely no need or desire to have overlapping read /
execute / write phases, instead making them completely and utterly
separate and running via an FSM, the answer is no, you have absolutely
no need of an FU-Regs / FU-FU Matrix pair (remember, you need both),
and yes, you will NEVER encounter read or write register file hazards,
because all operations occur sequentially.  read execute write read
execute write read execute write.

 performance will be dog slow, however that's not the point :)

> This is one of the things about the Q matrix I just "don't get."  Why is it
> needed?  It seems like even that is superfluous.  I've read and re-read
> numerous times Mitch's chapters, and I just get lost every single time.
> Can you help elucidate what value it provides?

 it's best answered by asking the question "what decisions have to be
made - what are the hazard points that could cause data corruption on
reg reads, reg writes, memory loads, memory stores, atomic operations,
interrupts and exceptions if not dealt with correctly - in a "simple"
pipelined *single* issue architecture?"

 write all of those out (including the design strategy for dealing
with RAW, WAR and WAW on the register file), and we'll go over them
one by one.

> >  so i deduce that you chose to stick with binary register numbers
> > instead of converting them to unary?
>
> At this point, I'm just drawing boxes and connecting them with arrows, so
> it's all still symbolic.  I consider binary vs unary an implementation
> detail at this level of abstraction.  If my understanding is correct (and
> it may not be), gate count notwithstanding, they both should yield
> identical results.

 yes.

> > hmmm if the names were those that are used in mitch's book chapters
> > i'd have an easier time understanding.  also, mitch himself could
> > comment.
>
> After re-reading both Thornton's book and Mitch Alsup's chapters, I decided
> to synthesize what I've known before with that I think I just learned.  I
> derived my design from my understanding of first principles, using terms
> I'm familiar with from the contemporary popular press, which even Mitch's
> enhanced vocabulary doesn't match (regrettably).  For example, when I read
> GO_WRITE, my brain registers that as, "OK, now it's time to drive the bus
> to write to the register file."  But, that's not actually what happens;
> when GO_WRITE is asserted, it *appears* to mean that the results have
> *already* been written to the register file and it's now OK for the FU to
> become idle.

 it's not quite like that (i don't believe), and it may not be helping
that the register file is twin-clocked (rising and falling edge).  i
can't remember which way round it is: rising is write, falling is
read, or something, and it's how write-through is achieved without
destroying data.

> It's deeply counter-intuitive to me.  I wanted my names to
> more accurately reflect what was happening as *I* understood things; since
> I'm most familiar with synchronous, edge-triggered designs found in FPGAs,
> signals indicate what /will/ happen, not what /has/ happened.  I figured
> once I had that, I could extrapolate and relabel signals with greater
> understanding later on.
>
> The closest (but apparently not quite perfect) analogies between my signals
> and Mitch's are:
>
> | Mitch's Signal                                  | My Signal
>                                                                 |
> |-------------------------------------------------+-----------------------------------------------------------------------------------------|
> | (Hinted at schematically; but left unspecified) | WBD (register writeback
> data bus)                                                       |
> | (Hinted at schematically; but left unspecified) | WBS (register writeback
> register select)                                                |
> | (Not specified at all)                          | WBVALID (register
> writeback bus has valid data)                                         |
> | BUSY                                            | BUSY_FU
>                                                                 |
> | GO_READ                                         | (generated internally
> based on BUSY_FU and what's driven on the register writeback bus) |
> | GO_WRITE                                        | RETIRE_FU
>                                                                 |
> | ISSUE                                           | ISSUE_FU
>                                                                  |
> | RELEASE_REQUEST                                 | RETIRE_REQUEST_FU
>                                                                 |

 ok cool.  close enough.

> >  those both then go each through a priority-picker (one each for read
> > and write, separately), and that's how you know to pick one AND ONLY
> > one Function Unit to be allowed to read (access to the READ ports of
> > the regfile), and one AND ONLY one to be allowed to write (access to
> > the WRITE ports of the regfile).
>
> I can see this kind of logic for protecting the write port; but, you could
> have 2n read ports on the register file (where n = # of FUs), though, yes?

 ... and have such an insane number of ports that the design of the
associated SRAM is utterly overwhelmed with gates.  multi-ported SRAM
is *massively* expensive.

> That seems like it'd save a few cycles (at the expense of more wires coming
> from the register file and/or more block RAMs used to implement the file in
> an FPGA).

 FPGAs typically max out at 2R1W.

 in the Libre RISCV SoC the plan is to break the register file into
HI-32 / LO-32 and then into odd reg# / even reg# to give 4 banks in
total.

 there will then be *four* completely separate sets of FUs (one for
each bank), with crossbars on the READ side NOT on the WRITE side.

in this way, we can do up to 4-issue of 32-bit elements and 2-issue of
64-bit standard RISC-V operations... yet still only require 3R1W or
2R1W ported SRAM to do it.

l.