[libre-riscv-dev] [Bug 296] idea: cyclic buffer between FUs and register file

Fri May 1 19:54:22 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=296

--- Comment #6 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #2)
> Seems like a good idea, however out-of-order (and in-order) processors
> depend on single-cycle forwarding between the results of one operation and
> the input of the next for most of their performance, the forwarding network
> would need to keep that property (haven't fully thought through if this idea
> keeps that property).

luckily, in a Dependency-Matrix-based system, nothing is timing-dependent.
the DMs (the combination of the FU-Regs and FU-FU matrices) preserve a
Directed Acyclic Graph of the relationships between all in-flight operations,
based on the register numbers, *not* based on the time taken to completion
*of* each operation, in any way.  FSMs, pipelines, async FUs,
variable-pipelines,
it's all the same to the 6600-DMs.

therefore (the point of explaining that is:) it doesn't _actually_ matter if
the forwarding between result latches at the output side of the Function Units
is delayed even 5-20 cycles in reaching the input latches of the FUs that need
that result.

btw in the original 6600, regfile read was done on the leading edge, regfile
write
was done on the falling edge.  this had the fascinating property that the
data coming in on a write could be available on the *same* clock cycle as a
read, effectively making the register file its own "Forwarding Bus".

anyway: unlike the 6600, we have a bit of a problem in that we cannot have
massive
register file porting.  the 6600's "A" regfile had *FIVE* read ports and *TWO*
write ports!

we are going to be under-ported because it's not really practical to design a
multi-ported (multi-write) SRAM in the time that we have (and their gate count
and power consumption is pretty high).  so, therefore, there will be quite a
few
times where data is generated faster than it can be written to the regfile.

therefore, operand-forwarding - even if it's a couple of cycles delayed - is
actually going to really important.

LOAD-STORE Computation Units, the LOAD in UPDATE mode can generate *two*
register writes.  fortunately, the ADDR-GEN side may be available one clock
cycle *before* the actual LOAD result.  however we need to plan for worst-case:
that the two are available simultaneously.

what we do *not* want to happen is that just because there is only 1W on the
regfile, the two writes from LOAD-UPDATEmode causes a bottleneck through the
regfile, when trying to get either of those two writes to the FunctionUnits
waiting for them.

therefore, if the cyclic buffer can forward at least one of those results
into the *read* cyclic buffer, then:

* one of the writes will go through the regfile and (if we design it
  right, a la 6600 regfile) *back* out through one of the read ports
  *on the same clock cycle*

* the other write will go through the end of the write cyclic buffer into
  the beginning of the read cyclic buffer, and if the Function Unit that
  needs that register happens to be RD1 (REQ_RD1 is asserted) it *will*
  get that result on the same clock cycle.

  if not, it will get it on the next clock (when shift-passed to RD2).

-- 
You are receiving this mail because:
You are on the CC list for the bug.