[libre-riscv-dev] load/store execution queue idea

Fri May 1 12:24:24 BST 2020

On Fri, May 1, 2020 at 3:54 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> I filled out some notes on my load/store execution queue idea here:
> https://libre-soc.org/3d_gpu/architecture/alternative-design-idea/

excellent.

> The design should be suitable for the final 28nm SoC and should be
> able to execute 4 loads or 4 stores or 4 AMOs or 4 fences per cycle,
> completely adjustable to some other number if we desire. This totally
> replaces the memory dependency matrix.

hmm.   the MDM is specifically designed to identify separate and
distinct LD-batches from ST-batches from Atomic-batches, in an
extremely gate-efficient fashion.  we don't need TSO, we just need to
preserve the order of batches (and also identify overlaps).

> One downside is it doesn't
> support forwarding from stores to later loads without going through
> the L1 cache.

that's ok.

ok, a critical aspect to understand about Computational Units: they do
*not* proceed until the "purpose" that they manage is entirely finished.
they then *notify* the Dependency Matrices that the "result" is done
(available), and *then* they drop their "busy" flag, ready for re-use for
another operation.

this is the total opposite of a pipeline or queue, where you chuck data
at it, abdicate responsibility for its management, and at some arbitrary
point in the future, the result comes out, but by that time everything
has moved on.

an out-of-order scheduling system *cannot* - in any way - permit that
to happen.  the results would be catastrophic.

therefore in the ALU Computational Units which manage a pipeline
(or pipelines), we *must* have equal or greated CompUnits than
there is pipeline depth.

* greater than that is ok: instructions will be able to run ahead.
  the DESIRE to run instructions (and the order as a DAG) is
  recorded, however their ACTUAL execution must wait until
  there is a free pipeline slot.
* less than that is NOT really ok (a waste of silicon) because there
  is not enough to track the progress of each and every operation
  going through each and every pipeline stage.

when it comes to managing Memory, the implications are as follows:

* to have more LD/ST Computation Units than there are "Cells"
  is likewise ok.  instructions will be able to run ahead.
* to have LESS LD/ST CUs than there are "Cells" is *NOT* ok,
  because the ComputationUnits will NEVERRRRR issue a
  LD or ST until it knows - absolutely and without any shadow of
  doubt - that the operation it monitors has 100% absolute guaranteed
  completed.

so that's really crucial information that has implications for the
design of the LD/ST Buffer, important as context, below.

notes:

1) the cell arrangement still needs an 8-input (8 LD/ST Computation Units)
to 4-output (head of each "column" of each queue) multiplexer.
unless i've misunderstood, that alone is a killer.

dropping each of the 8-input queue entries into the first 8 cells *then*
performing shifting, *that* is okay.  but allowing each and every single
one of those 8 LD/ST CompUnits to write to cells 0-4? the wiring alone
is insane.

assume 128 bits per operation (48 address, 64 data, some control lines
let's round it up to 128 bits): with 128-wide data paths times 8 LD/ST
Comp Units times *4* to be able to arbitrarily write to any one of
4 "top row" cells?  that's a MASSIVE 4096 wires!

we can't possibly have that many wires.

if you meant that the queue width should match the number of LD/ST
Units - 8-wide because we need 8 LD/ST CompUnits (we might be
able to get away with only 6) - then the remaining entries will actually
remain permanently empty, because there *has* to be a direct
one-to-one match between LD/ST CompUnits and the LD/ST that
it "manages".

so there would only be 8 LD/STs, and a row depth of 1.

if however we do 2-ports per CompUnit (to cover misaligned), then there
would be an 8-wide by 2-row arrangement of Cells.

with only a Cell Depth of 1 or 2, the purpose *of* the Cells is defeated
entirely.  there's no need for shifting: the entries might as well stay
where they are.

this particularly because even for the "misalignment" case, the LD/ST
CompUnit is *NEVERRR* going to place anything other than the
2 requests it is responsible for into the queue:

* first one covering the lower part of the misaligned request
* second one covering the upper part.

no matter how many columns (assuming that there is one
per LDSTCU) all other rows *will* remain empty.

and if the columns are halved (4 wide to cover 8 LD/ST CUs)
now we have a much more serious problem: the total data bandwidth
is shrunk to only 4x 64-bit operations *and* there is a multiplexing
problem actually getting data into the first 2 rows (actually, now,
4 rows if there are 2 ports per LD/ST CU, one for lower-misaligned
one for upper-misaligned).

do we stall the requests if they do not have access?  that means
that opportunities to keep the L1 cache-line requests 100% full
are missed, where the current design can *even catch misaligned
element-strided data requests* and can 100% merge them into
a full cache-line request.

"and the L1 cache line address it would use is broadcast to all cells."

this is a full XOR-comparison of the entire address, against every single cell.

if the address is 48 bits and there are 12 cells, using bits 3 and above of
the address will require 528 XOR gates.

if we want *two* L1 cache line addresses simultaneously (rather than
going to 256-bit-wide L1 cache lines), that's *two* broadcasts, each 528
XOR gates.

2) how are overlapping addresses detected?  or, more to the point,
how are non-overlapping addresses detected such that they can be
passed through as a group, and given the opportunity to be done in
parallel?

3) how would misaligned addresses handled?

4) i'm not seeing how requests could be merged into "100% full cache
line" requests, which the current design is capable of doing, by using
simple bit-wise ANDing and ORing.  the Abstract Prefix Sum class
doesn't look obvious as to how that happens, and is likely to be
much more costly than a simple AND and OR network.

5) how are clashes of addresses *inter-row* detected?  if Cell 0 has a ST
of 8-bits to address 0x0001 and Cell 1 has a ST of 8 bits to address
0x0002 (a 7 byte overlap), how is this detected *in the row*?

in the column it is fine, because only one row is ever done at once
(which has implications for mis-aligned data requests, there will never
be a 100% merge opportunity)

all of these things took several *months* to factor in to the design, in
several stages.  it's massively complex and inter-dependent.

l.