[libre-riscv-dev] memory interface diagram woes

Thu Apr 23 11:29:22 BST 2020

btw a word to the wise jacob, the current code is even though not complete,
ridiculously comprehensive and tied directly into the Dependency Matrix
system.

* LDSTs have regs therefore they need standard FU-Regs and FU-FU presence.

* FU-Regs takes care of the Reg RaW and WaR dependencies

* FU-FU takes care (in the case of LD) of the *order* of which FU is
waiting for which result, it is like a linked list using unary bits and
consequently is a square matrix

on *top* of that we have an additional problem to solve:

* preserving the order of memory accesses (if they matter)

* ensuring that overlaps result in order preservation but that when there
are no overlaps at least respect Weak Memory Order (LDs separate from STs)

on top of *that* we have an *additional* requirement to merge multi-issued
LD/STs that can fit into a single cache line hit, because of the enormous
bandwidth requirements of 3D Vectors.

it is ridiculous but necessary that even in a 180nm single core we need a
staggering 256 bit wide data path to the L1 cache and that this width will,
when we go to faster speeds, also be needed to the L2 cache as well.

the overlapping address conflict and Weak Order Preservation is achieved
with an addressmatch Matrix, last year, thanks to Mitch Alsup.

however Mitch's scheme works by comparing bits 4 thru 11 or 5 thru 11
(avoiding the VM bits and aldo avoiding bits 0 to 3) and ending up with
0.5% additional overzealous hits which make *no significant performance
degradation* in real world usage.

but his scheme was *not* possible, as-is, to use for our needs, because it
missed opportunities to merge byte level accesses because of only comparing
bits 4 thru 11 of the address.

i therefore expanded the LDST LENGTH and bits 0 thru 3 into a 24 bit
bytemask / bitmap, then split that into 2 requests.

why 2 requests, because of misaligned LDST, the 1st 16bits of the bytemap
are 1 cache line and the remaining 8 bits the 2nd cache line.

with "Weak" memory rules in place, and with the other job of the
addressmatch matrix taking care of separating LDs and STs into mutually
exclusive batches, splitting into multiple operations is perfectly "fine".

now with these bytemasks available, and the remaining bits of the binary
form of the address (bits 4 thru 48) being aligned to a cache line
boundary, all we need do is identify which requests have the exact same
upper bits and we can merge the bytemasks for all of those and thus make
only the one single L1 cache line request for all of them.

this is the task of the L0 Cache/Buffer and as you can see it very closely
and very specifically depends on the criteria being satisfied by earlier
components.

the datapath requirements are also absolutely MENTAL and we really,
*really* cannot have 8 or 16 way to 4 or 8 way multiplexers, it is just too
much as it will result in literally thousands of wires because we have 48
bit addresses, 64 bit data, and additional control signals.

assume 128 bits per LDST Function Unit, if we have 8 LDST FUs, each with 2
ports (one for aligned, one for misaligned), that is 128 x 2 x 8 bits wide
i.e. 2048 bits datapath!

now imagine we design an algorithm that needs a 4 or 8 way multiplexer on
that, to get 4 or 8 simultaneous requests into even just an 8 entry
"Queue", implemented as a standard 4R4W (or 8R8 W) Memory or registers, and
that is a staggering 8192 or 16384 wires through multiplexers!

that is completely mental and there is no way we can do that, the layout
would be far too complex even if it was practical.

this is just one of the things that we have to be extremely careful about
and it is why i am taking time to draw out these diagram.

so the current proposed L0 Cache/Buffer has *independent* ports on each
row, for use by each Function Unit.

and onto the L1 side the connection is 1R *or* 1W at 128 bit (cache line)
width.

yesterday i made some corrections that merged the pair of L0 Cache/Buffers
into a single dual ported one.  i explain why on the 6600 scoreboard page.

thus each FU produces up to 2 requests, one goes left, the other goes
right, into one of two "lanes", mutually exclusively separated by address
bit 4.

there are therefore *two* 128 bit wide ports to *two* separate L1 caches,
one for addr[4] = 0 and the other for = 1 and they interleave alternate
cache lines nicely.

the number of ways is a standard L1 cache decision as is everything else
about the L1 cache.

so please do take into account, Jacob, that the data path sizes are
completely mental, and that multi way multiplexing is not an option.

the multiplexing for the design that i did is only a 2x2 crossbar for each
FU, with 192 bit datapaths on each (might be able to get that down to 128
or slightly less).  that is *all* because on the other side, after the
bytemask merge, it is straight 1R or 1W to the L1 cache.

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68