[libre-riscv-dev] Scoreboards

Samuel Falvo II sam.falvo at gmail.com
Mon May 13 23:07:12 BST 2019


On Mon, May 13, 2019 at 11:11 AM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
> On Mon, May 13, 2019 at 6:20 PM Samuel Falvo II <sam.falvo at gmail.com> wrote:
> preeetty much, yeah.  firstly: it turns out that [correct] implementations
> of a scoreboard - the entire scoreboard - are a lot less in gates than any
> one FU.

I tried my hand at conceiving a FU (simple binary adder) design on
paper (assuming an FPGA-friendly design, so no latches[1]; includes
both function unit and computation unit functionality, since for the
purposes of exploration/experimentation, it makes more sense to keep
them together), and what I ended up with, I think, was basically a
single reservation station Tomasulo FU.

Two of the three RSFFs in my design correspond to the Vj and Vk
(valid) bits for Qj and Qk, respectively, so I recognize that, at
least.  These flags are set by 5-bit comparators (e.g., when busy & Rj
== the register ID on the writeback bus, then set the Vj flag and
register the writeback data).  The remaining RSFF is the "instruction
loaded/busy" flag.  The timing chain kicks off only when busy & Vj &
Vk are true.  When RETIRE_FU is asserted, the three RS latches are
reset, bringing the unit back to a quiescent state.

The FU had the following control signals:

1) ISSUE_FU - Loads the current instruction word into the FU's local
instruction register, and sets the BUSY_FU flag in the next cycle
(RSFF) that goes back to the scoreboard and gates other FU activities.
2) BUSY_FU - True if the FU is in use.
3) WBD - Write-Back Data bus.
4) WBS - Write-Back Select (which register it's writing back to).
5) WBVALID - True if data on the write-back bus is valid (basically, a
logical OR of all RETIRE_FU signals).
6) RETIRE_REQUEST_FU - Asserted some time after both operands have
been filled (somehow).
7) RETIRE_FU - Resets all the RSFFs, and drives this FU's results on on WBD/WBS.
8) RETIRE_RS - Register select (to appear on WBS when the time is right).
9) RETIRE_D - Register writeback data (to appear on WBD when the time is right).

I'm assuming that the register selects captured in the local
instruction register when the ISSUE_FU signal was true will also feed
back into the scoreboard /somehow/, so that it knows which registers
this FU is reserving.

This would work for register/register operations where the FU is
blocked waiting for the source operands to be committed; anything
involving immediates or register operands which are not blocked would
need additional elaboration of the design (and probably more
issue-time signaling), but I think it's more or less easily expanded
to support that case.

> secondly: remember that the 6600 was very specifically from an era where
> the number of gates (transistors) really mattered.  a huge amount of effort
> went into getting the highest bang-per-buck ratio that they could get.
> using SR-NOR Latches instead of DFFs for example.

True, but I'd argue that the FPGA I'm working with is much tighter
than what Cray/Thornton had to work with.  I'm limited to the
equivalent of about 5000 2, 3, or 4-input gates.  I do have the
advantage that each "gate" has a corresponding DFF associated with it.
ECP5K would ease this limitation significantly, of course.

________
1.  As convenient as I find working with FPGAs to be, I really miss
the freedom of being able to use latches and level-sensitive logic.
Example: exploiting NOR-type RS latch behavior and gate delays to /in
effect/ function identically to an edge-triggered DFF is just #@#^ing
brilliant.  A lot of what made the 6502 work the way it did was
extensive use of latches and level-sensitivity.  Come to think of it,
I'm not aware of anything inside the 6502 that was edge-sensitive,
which is what I think set it apart from the 6800, which you might
recall, required an input clock 4x the desired bus clock to function
at all.  With edge-triggered designs, the best you can do to simulate
it is to drive your clock at (at least) twice the desired rate and
alternate explicitly between two different phases.  Perhaps no
processor is worse at this than the original Intel 8051 family, which
required a whopping 12 clocks per machine cycle!  Bleh!  But I
digress.

-- 
Samuel A. Falvo II



More information about the libre-riscv-dev mailing list