[libre-riscv-dev] Scoreboard and LDST questions

Sun Jun 21 03:46:47 BST 2020

On Sunday, June 21, 2020, Yehowshua <yimmanuel3 at gatech.edu> wrote:

> Hello Luke,
>         Spent the past few hours pouring over the LDST architecture as
> well as the men-architecture.
>
> Some comments/questions:
> https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
> > The SR Latches create a forward-progressing Finite State Machine, with
> three possible paths:
>
> From your description, only two of issue, go_rd, and go_wr can be active
> at any given time.

yes. as SR Latches, this prevents and prohibits all three SR Latches from
achieving an "unknown" state.  you can see this clearly by examining
Mitch's CU drawing, the three SR latches are cyclically linked.

you have to walk *really* carefully through it, to spot it.  takes about
10-15 minutes.

however for our CUs we are not using SR Latches: we do not have time
available to verify their correctness.  and the number needed is small
(unlike in the DMs).

so we get away with DFFs.

>  That means that a maximum of 2 instructions can execute every 3 cycles?

this is incorrect on one count and slightly misleading on another.  three
things.  let me go through them.

a CompUnit equals a Reservation Station.  this in both 6600 and Tomasulo
terminology you *know* that the CU (aka RS) MUST handle one and only one
instruction at a time.

this is the *definition* of a CU aka RS.

one instruction and one instruction ONLY.

this by both definition and by logical reasoning in that if it tries to
handle two instructions at once, it must throw one result into the wind as
far as dependency tracking is concerned (with associated catastrophic data
corruption. given that nobody designs processors that corrupt data..)

OR

it must be designed *to* track two instructions, have two connections to
the Dependency Matrices, have two sets of register ports.

the easiest way by far to do that? have... two... CompUnits.   go figure.

so both by definition of what a CompUnit (aka RS) is, and by logical
reasoning, we can deduce that the statement that it handles two
instructions every 3 cycles is false.

it must - by definition - only be one instruction. that will be every 3
cycles (excluding pipeline latency time).

first cycle: read opcode
second cycle: read operands from regfile
third cycle: process result through pipeline
fourth cycle: write result to regfile.

actually the interaction with the pipeline is combinatorial such that this
_can_ be 3 cycles not 4.

so that's the 2nd thing.  the *communication* with the *pipeline* requires
3 cycles.  if the pipeline is 5 stages, that's 8 cycles.

sounds like shit performance, doesn't it? total waste of time and effort to
even bother with, making something that only handles one instruction every
3 cycles, right?

and we said that we are doing a multi issue OoO design, which means 2 or 4
IPC, not 0.2 IPC.

so there must be something extra going on, something missing, yes?

and it's very simply this: you lay down multiple CompUnits (multiple RSes).

now there are several options, here:

1. single stage simple pipelines (add, or xor etc)

these, if you want 1 IPC single issue, because the CompUnit has a 3
instruction overhead, you must lay down a minimum of 3.

four is better so that there is one outstanding in-flight operation.

2. multi stage pipelines.

for these you need Concurrent CompUnits (see Mitch book chapters for
definition)

these funnel MULTIPLE CompUnit frontends into *ONE* multi stage pipeline.

there is an Arbiter at the front, selecting one CU per clock, pushing one
and ONLY one set of operands from one and ONLY one CU.

there is an index which allows it to be reassociated with its CU when the
result pops out the end.

see test_inout_mux_pipe.py in nmutil for a simple example.

if the pipeline is 5 stages, add the 3 instr. overhead, you need *eight* CU
frontends to ensure no blocking at the instruction issue phase.

3. Blocking Finite State Machines (typically div units)

these if allowed to block would completely lock up the issue engine.

therefore they must also have multiple CompUnit (aka RSes) frontends.

this allows several inflight instructions to proceed to issue *without*
blocking at issue phase, waiting for the FSM to complete.

however the difference between (2) is that they *only* process (allow) one
result to be computed (because it's a FSM) where a multi stage pipeline
obviously has multiple partial results inside it.

everything - all three possible arrangements - is *specifically* allocated
so that there are AT THE VERY MINIMUM at LEAST three Computational Units
running and in DIFFERENT phases.

* one CU will be in op issue phase.
* one CU will be in src read phase
* one CU will be in dest write phase.

by having these THREE CUs (aka RSes) active, one of them *will* become free
in any clock cycle: that free CU may therefore accept the current
instruction and therefore there will be no stalling.

therefore we achieve 1 IPC.

that is of course assuming that there are no hazards.

if hazards exist those 3 CUs are inadequate because one of them will block,
waiting for the output from one other CU.

therefore we put lots more CUs down (at least twice more).

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68