[libre-riscv-dev] Fwd: multi-way LOAD/STORE buffers and misalignment

Tue Mar 24 16:05:16 GMT 2020

staf, hi, this came in from mitch alsup, from comp.arch.

the L1 cache memory SRAM that you are doing: is it, as Mitch suggests,
1R *or* 1W, and thus single-clock-cycle?  i gather from what he's
saying that if you have a 1R1W SRAM, that effectively needs 2 clock
cycles to access.

l.

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

---------- Forwarded message ---------
From: 'MitchAlsup' via comp.arch <comp.arch at googlegroups.com>
Date: Tue, Mar 24, 2020 at 3:27 PM
Subject: Re: multi-way LOAD/STORE buffers and misalignment
To:

On Tuesday, March 24, 2020 at 6:53:29 AM UTC-5, lkcl wrote:
> we're doing a LD/ST buffer at the moment, and some design review and
> constructive feedback and advice would be greatly appreciated.
> unlike many "first time processors", we cannot possibly get away
> with a simple single-issue LD/ST design as the first iteration because
> of the vector engine being effectively a multi-issue multiplier that,
> from a single instruction, hammers the internal engine with a *batch*
> of actual operations.
>
> this includes LD/ST and so consequently we need a multi-issue-capable
> LD/ST buffer right from the start.  it is being documented, here:
>
> https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/
>
> with Mitch's help i already have a RAW / WAR dependency detection matrix
> in place, which detects (in-order) *batches* of LDs, *batches* of STs,
> and *batches* of atomic operations (defined by having LD/ST both set).
>
> additionally, we have (again, thanks to Mitch), an "address clash detection"
> Matrix, where the original design concept from Mitch uses bits 4 thru 11
> (stopping at a VM page) to say "this LD/ST definitely doesn't overlap with
> any others, but it might have picked up a few that *potentially* don't
> overlap, but hey better safe than sorry".
>
> this is quite efficient because the number of bits used to detect a clash
> is drastically reduced, which is important given that it's an NxN/2 set of
> binary-address comparators involved.
>
> however we *modified* this idea, by turning the addr+len (len=1/2/4/8 byte)
> into a *byte-mask*, where the byte-mask is effectively nothing more than
> (or, "can be used as") byte-enable lines on the L1 cache memory read/write.
>
> this means turning len=1 into 0b1, len=2 into 0b11, len=4 into 0b1111
> or len=8 into 0b1111 1111 and then shifting it by the LSBs of the address
> (bits 0-3).
>
> however of course, that could "wrap over" to a second cache line, therefore
> we split it into two:
>
> * 1st cache line, take bits 0..15 of the mask, use addr[4:] as the address
> * 1nd cache line, take bits 15..27 of the mask, use addr[4:] PLUS ONE as address
>
> now all LDs are potentially *two* LDs, and all STs are potentially *two* STs,
> requiring synchronisation.
>
> this does interestingly mean that the "address clash detection" Matrix is
> now doubled in size, because each LD/ST Computation Unit can now issue *two*
> LDs/STs, not one.
>
> there is a big advantage to partial-conversion of the LD/STs to unary-masked
> format:
>
> 1) those "masks" we can now detect in the address-clash unit, whether there
>    are overlaps, by simple "unary" (ANDing).
>
> 2) once a batch of guaranteed-non-overlapping LDs (or STs, but not both at
>    once) have been identified, the ORing of those masks *pre-identifies*,
>    where the remaining MSBs are identical, which LDs and which STs are
>    *part of the same cache line*.
>
> 3) the ORed masks can be dropped - as-is - directly onto byte-level
>    read/write-enable lines on the L1 Cache underlying memory - with zero
>    shifting or other messing about.
>
> so that's the bits that we "know".
>
> the bits that we *don't* know are what is the best L1 cache arrangement
> to use.  we discussed the possibility of an 8-way set-associative cache
> with 512-bit wide data (!!!) - yes, really, because we will have a minimum
> of 4 LD/ST operations @ 64-bit occurring every cycle (in the first version
> of the processor: this *will* go up in later iterations).
>
> one thing that would be particularly nice to avoid would be multi-ported
> SRAM on the L1 Cache Memories.  i thought there it may be possible to
> "stripe" the L1 Cache, such that bits 4-5 of every address go through to,
> effectively, a completely separate (isolated, independent) L1 Cache.
>
> * Addr bits [0:3] already turned into byte-map mask (enable lines)
> * Addr bits [4:5] used to select one of four INDEPENDENT caches
> * Addr bits [6:end] used "as normal"
>
> the only downside of this being: LD/STs that happen to be on the same
> 64-byte boundary would be "penalised" by being only single-issue-capable.
>
> do we care?  i'm not sure that we do!

Yes, you do want to care:: you want to care to "get it correct" even
at some minor cost in perf.
>
> _should_ we care?  that's the question i'd like to know the answer to :)
> an alternative is that we have to multi-port the L1 Cache memory, which
> would be quite expensive.
>
> what alternatives are there?  is there a way to, for example, have an
> 8-way set-associative cache with 8 independent input/output data paths,
> yet they do not require the trick of a full split using addr bits [4:5],
> but at the same time do not require 4R4W SRAMs?

What you want to know are the dimensions of the SRAM macros. Given these
dimensions you are in a position to start making good choices, without
you are not. A typical SRAM dimension might be 256-bits wide and 64-words
deap (2KBytes).

Since you want to read/write 512-bits per port, you will need 2 SRAM
macros per port (per cycle). So if you are making 8-way SA cache, you
will need at least 16 SRAM macros (32KBytes)--and (A N D) you still
have but a single port!

On the other hand, dropping back to 4-way set, and that same SRAM array
can now support 2-ports of 4-way SA each. In this organization you will
be using more lower-address bits to choose port[0] versus port[1]. And
you will see 40% conflicts if you are really running 2 request per cycle.

However, if you know the access pattern is "dense", then you know that
LD[i][k..k+3] are all in the same line, and you can THEN make one access
for every 4 actual LDs an then sort it out with multiplexers. And NOW we
have enough ports and width to use 1024 bits per cycle throughput in the
cache. {You can determine an access pattern will be dense when the loop
increment is +1 (or -1) and the loop index is used in scaled form in
memref instructions:

     LD   Rx,[Rb+Ri<<scale+DISP]

} Other forms of dense access are harder for HW to detect and utilize.
>
> i *think* this is doable, with the potential for some wait cycles if
> one of the 8 ways happens to have request contention.  honestly i don't know
> what's best here.
>
> advice and discussion appreciated as always,
>
> l.