[libre-riscv-dev] Fwd: store computation unit

Wed Jun 19 12:49:27 BST 2019

Resend as the new config upgrades may have caused this not to get sent,
can't see it in the archives.
L.

---------- Forwarded message ----------
From: *Luke Kenneth Casson Leighton* <lkcl at lkcl.net>
Date: Wednesday, June 19, 2019
Subject: Re: store computation unit
To: Mitchalsup <mitchalsup at aol.com>, Libre-RISCV General Development <
libre-riscv-dev at lists.libre-riscv.org>

On Wed, Jun 19, 2019 at 3:09 AM Mitchalsup <mitchalsup at aol.com> wrote:

> Luke,
>
> I have a clever Idea for you. I have been thinking about forwarding
> operands from the result stack at the end of
> calculation units back to operand stack at the front of calculation units.
> I have figured out a way that take almost
> no gates of logic and adds no delay to the scoreboard!
>

niiice.

> I am attaching a 1-operand FU of the same drawing style
> as you are already proficient; for your amusement.
>

appreciated.

>
> [image: 1-Operand.jpg]
>

took the liberty of putting it here:
https://libre-riscv.org/3d_gpu/forwarding-1-operand.jpg

> I will work on 2-Operand and maybe 3-Operand tomorrow.
> But notice how little logic is required!
>

so little i am having difficulty identifying where it is :)

>
> So the thought train is that:: one feeds Executable into the pickers and
> when picked the Readable and Forwardable
> signals tell you whether to read the RF or to forward from the result
> stack.
>
>
have been thinking about this on and off as well, and i believe you have
the piece of the puzzle that allows forwarding opportunities to be
identified.

my understanding is: one picker is needed each per register file port,
whether it be a forwarding port or a read-or-write-to-regfile port.

thinking out loud: the two key differences i perceive (there may be more)
between pick-and-go-rd/write for register file read/write and forwarding
are:

(1) you need to identify the locations (and corresponding matching
registers) when one (or more!) FUs have read-pending, and *another* FU has
request_release pending.  if i am reading it correctly think this is the
subject of the diagram that you designed (side-note: this is an opportunity
for *broadcast* of the FU result to *multiple* recipient FUs)

(2) even if the FU result has been forwarded, it is critically important
not to drop that result (as would be done with a regfile write), because it
could still be necessary to write it to the regfile.

identifying the circumstances under which it is possible to drop the result
was the subject of that "nameless" discussion we had back in
november-december, and it involved detecting of shadowing (any outstanding
potential cancellation opportunities, whether they are branch shadows,
exceptions, or WaWs) as well as detection of remaining read hazards.

for any FU that is in the "forwardable" condition (that has not yet been
given an opportunity to write to the regfile) *only* when both the last
shadow *and* the last read hazard on that FU has been dropped, is it then
safe to simply drop the FU result on the floor.

if forwarding is never added, the above situation (2) does not even enter
into the equation.  everything goes via the regfile as the arbitrator.

creating the multi-level priority picker, whilst i appreciate it is a
perfect candidate task as a 1st level engineering exam question, i cannot
think of a way to create one without it being recursive in nature.

a single priority-picker is needed to protect a single resource (regfile
port, whether read or write)

the multi-level priority picker is needed to be able to allow *multiple*
ports, whether they be forwarding or actual-regfile.

* the "Highest Priority Level Picker", clearly, if activated, must *stop*
the 2nd level picker from selecting the exact same readble (or writable)
signal.
* the 2nd level priority picker, clearly, if activated, must stop the *3rd*
level priority picker from picking either the 1st *or* 2nd level picker
readable (or writable) signal
* and so on.

this to give multi-port reading and writing... *and operand forwarding
opportunities*.

the only thing is:

(3) whilst operand forwarding is a great opportunity for broadcasting of FU
result to pending CU input latches, a design that makes it necessary to
have *TWO* (or more) register reads ready before the forwarding (or regfile
reading) can proceed... this is a *lost opportunity*.

it would i feel be far better to have Go_Read_Operand_1, Go_Read_Operand_2
and to have corresponding completely separate Priority Pickers (more to the
point: multiple multi-level priority pickers) for each separate and
distinct operand.

then, two things happen which differ from the 6600:

(a) the regfile read bus (each read port) can become a *broadcast* bus,
with each operand (each Computation Unit latch) being independently ready
to receive operands if their read is pending.

(b) errr... i forgot :)  too focussed on writing (a).  something to do with
forwarding, which i probably covered above already.

all that having been said (as context), i believe you are onto something
with this design.  i had envisaged that, perhaps, it might be necessary to
identify the precise match of FUs that are in the "request_release" state,
and match those exactly with those that have a read pending.

this in a 2D (resource-eating) matrix.

the insight that you provide is that this is *not* necessary (however doing
so is a potential optimisation).  the global read pending vector is
perfectly sufficient to determine that there is *at least one* FU in a
"ready to receive operands" state, and that is enough to know that it would
be useful to forward the result from the FU.

the potential optimisation is in identifying *how many* other FUs are ready
to receive operands, and to prioritise those for forwarding progress over
the (lesser number of) FUs.

this of course assuming that the FU Forwarding bus is, like the Tomasulo
Algorithm, more of a "Broadcast" Bus, hence the need to establish context,
above.

if the forwarding bus is not a broadcast bus, being instead a one-to-one
data route (like the regfile read is in the original 6600), then there is
no need - at all - to consider such a "are there more FUs pending this
result" optimisation because you have to spend multiple clock cycles
forwarding the FU's result individually to other FUs (CUs) ready to receive
it.

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68