[libre-riscv-dev] load/store execution queue idea

Wed May 6 01:23:59 BST 2020

Will respond to all these concerns later, once there is more free time.

On Fri, May 1, 2020 at 4:25 AM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>
> On Fri, May 1, 2020 at 3:54 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> > I filled out some notes on my load/store execution queue idea here:
> > https://libre-soc.org/3d_gpu/architecture/alternative-design-idea/
>
> excellent.
>
> > The design should be suitable for the final 28nm SoC and should be
> > able to execute 4 loads or 4 stores or 4 AMOs or 4 fences per cycle,
> > completely adjustable to some other number if we desire. This totally
> > replaces the memory dependency matrix.
>
> hmm.   the MDM is specifically designed to identify separate and
> distinct LD-batches from ST-batches from Atomic-batches, in an
> extremely gate-efficient fashion.  we don't need TSO, we just need to
> preserve the order of batches (and also identify overlaps).
>
>
> > One downside is it doesn't
> > support forwarding from stores to later loads without going through
> > the L1 cache.
>
> that's ok.
>
> ok, a critical aspect to understand about Computational Units: they do
> *not* proceed until the "purpose" that they manage is entirely finished.
> they then *notify* the Dependency Matrices that the "result" is done
> (available), and *then* they drop their "busy" flag, ready for re-use for
> another operation.
>
> this is the total opposite of a pipeline or queue, where you chuck data
> at it, abdicate responsibility for its management, and at some arbitrary
> point in the future, the result comes out, but by that time everything
> has moved on.
>
> an out-of-order scheduling system *cannot* - in any way - permit that
> to happen.  the results would be catastrophic.
>
> therefore in the ALU Computational Units which manage a pipeline
> (or pipelines), we *must* have equal or greated CompUnits than
> there is pipeline depth.
>
> * greater than that is ok: instructions will be able to run ahead.
>   the DESIRE to run instructions (and the order as a DAG) is
>   recorded, however their ACTUAL execution must wait until
>   there is a free pipeline slot.
> * less than that is NOT really ok (a waste of silicon) because there
>   is not enough to track the progress of each and every operation
>   going through each and every pipeline stage.
>
> when it comes to managing Memory, the implications are as follows:
>
> * to have more LD/ST Computation Units than there are "Cells"
>   is likewise ok.  instructions will be able to run ahead.
> * to have LESS LD/ST CUs than there are "Cells" is *NOT* ok,
>   because the ComputationUnits will NEVERRRRR issue a
>   LD or ST until it knows - absolutely and without any shadow of
>   doubt - that the operation it monitors has 100% absolute guaranteed
>   completed.
>
> so that's really crucial information that has implications for the
> design of the LD/ST Buffer, important as context, below.
>
>
> notes:
>
> 1) the cell arrangement still needs an 8-input (8 LD/ST Computation Units)
> to 4-output (head of each "column" of each queue) multiplexer.
> unless i've misunderstood, that alone is a killer.
>
> dropping each of the 8-input queue entries into the first 8 cells *then*
> performing shifting, *that* is okay.  but allowing each and every single
> one of those 8 LD/ST CompUnits to write to cells 0-4? the wiring alone
> is insane.
>
> assume 128 bits per operation (48 address, 64 data, some control lines
> let's round it up to 128 bits): with 128-wide data paths times 8 LD/ST
> Comp Units times *4* to be able to arbitrarily write to any one of
> 4 "top row" cells?  that's a MASSIVE 4096 wires!
>
> we can't possibly have that many wires.
>
> if you meant that the queue width should match the number of LD/ST
> Units - 8-wide because we need 8 LD/ST CompUnits (we might be
> able to get away with only 6) - then the remaining entries will actually
> remain permanently empty, because there *has* to be a direct
> one-to-one match between LD/ST CompUnits and the LD/ST that
> it "manages".
>
> so there would only be 8 LD/STs, and a row depth of 1.
>
> if however we do 2-ports per CompUnit (to cover misaligned), then there
> would be an 8-wide by 2-row arrangement of Cells.
>
> with only a Cell Depth of 1 or 2, the purpose *of* the Cells is defeated
> entirely.  there's no need for shifting: the entries might as well stay
> where they are.
>
> this particularly because even for the "misalignment" case, the LD/ST
> CompUnit is *NEVERRR* going to place anything other than the
> 2 requests it is responsible for into the queue:
>
> * first one covering the lower part of the misaligned request
> * second one covering the upper part.
>
> no matter how many columns (assuming that there is one
> per LDSTCU) all other rows *will* remain empty.
>
> and if the columns are halved (4 wide to cover 8 LD/ST CUs)
> now we have a much more serious problem: the total data bandwidth
> is shrunk to only 4x 64-bit operations *and* there is a multiplexing
> problem actually getting data into the first 2 rows (actually, now,
> 4 rows if there are 2 ports per LD/ST CU, one for lower-misaligned
> one for upper-misaligned).
>
> do we stall the requests if they do not have access?  that means
> that opportunities to keep the L1 cache-line requests 100% full
> are missed, where the current design can *even catch misaligned
> element-strided data requests* and can 100% merge them into
> a full cache-line request.
>
>
>
> "and the L1 cache line address it would use is broadcast to all cells."
>
> this is a full XOR-comparison of the entire address, against every single cell.
>
> if the address is 48 bits and there are 12 cells, using bits 3 and above of
> the address will require 528 XOR gates.
>
> if we want *two* L1 cache line addresses simultaneously (rather than
> going to 256-bit-wide L1 cache lines), that's *two* broadcasts, each 528
> XOR gates.
>
>
> 2) how are overlapping addresses detected?  or, more to the point,
> how are non-overlapping addresses detected such that they can be
> passed through as a group, and given the opportunity to be done in
> parallel?
>
> 3) how would misaligned addresses handled?
>
> 4) i'm not seeing how requests could be merged into "100% full cache
> line" requests, which the current design is capable of doing, by using
> simple bit-wise ANDing and ORing.  the Abstract Prefix Sum class
> doesn't look obvious as to how that happens, and is likely to be
> much more costly than a simple AND and OR network.
>
> 5) how are clashes of addresses *inter-row* detected?  if Cell 0 has a ST
> of 8-bits to address 0x0001 and Cell 1 has a ST of 8 bits to address
> 0x0002 (a 7 byte overlap), how is this detected *in the row*?
>
> in the column it is fine, because only one row is ever done at once
> (which has implications for mis-aligned data requests, there will never
> be a 100% merge opportunity)
>
>
> all of these things took several *months* to factor in to the design, in
> several stages.  it's massively complex and inter-dependent.
>
> l.
>
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev