[libre-riscv-dev] [Bug 216] LOAD STORE buffer needed

Wed Mar 11 22:54:37 GMT 2020

http://bugs.libre-riscv.org/show_bug.cgi?id=216

--- Comment #1 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #5)
> (In reply to Luke Kenneth Casson L
> > whew.
> > 
> > so that's 128-bit-wide for _textures_... that's on the *load* side.  are
> > there any simultaneous (overlapping) "store" requirements? are the
> > code-loops tight enough to require simultaneous 128-bit LD *and* 128-bit ST?
> 
> yes and no -- there is code that will benefit from simultaneous loads and
> stores (memcpy and probably most other code that has both loads and stores
> in a loop), however it isn't strictly necessary.

ok.

> It will be highly beneficial to support multiple simultaneous 8, 16, 32, or
> 64-bit loads to a single cache line all being able to complete
> simultaneously independently of alignment in that cache line. 

yes i have been thinking about this for a day since getting the LDSTCompUnit
operational.  it currently uses a "fake" memory block (for testing) and the
next step is to look at this.

> Also
> misaligned loads that cross cache lines (and possibly page boundaries),
> though those don't need to complete in a single cache access.

ok

> All the above also applies to stores, though they can be a little slower
> since they are less common.
> 
> I realize that that will require a really big realignment network, however
> the performance advantages I think are worth it.

yes.  we cannot have high performance computation yet no good getting data in
and out.

> For a scheduling algorithm for loads that are ready to run (6600-style
> scheduler sent to load/store unit for execution, no conflicting stores
> in-front, no memory fences in-front), we can have a queue of memory ops and
> each cycle we pick the load at the head of the queue and then search from
> the head to tail for additional loads that target the same cache line
> stopping at the first memory fence, conflicting store, etc. Once those loads
> are selected, they are removed from the queue (probably by marking them as
> removed) and sent thru the execution pipeline.
> 
> We can use a similar algorithm for stores.

right. ok. thanks to Mitch Alsup, i have a Memory Dependency matrix that takes
care of discerning and preserving the loads and stores into batches.  we could
if we wanted to do not only TSO, it can handle cross-SMP in a way that makes
atomic memory either entirely moot or dead trivial.  this i need a little more
research on.

anyway the point is: LOADs as a batch are already identified and hold up any
STOREs, and vice-versa.

> To find the required loads, we can use a network based on recursively
> summarizing chunks of the queue entries' per-cycle ready state, then
> reversing direction from the summary back to the queue entries to tell the
> entries which, if any, execution port they will be running on this cycle.
> There is then a mux for each execution port in the load pipeline to move the
> required info from the queue to the pipeline. The network design is based on
> the carry lookahead network for a carry lookahead adder, which allows taking
> O(N*log(N)) space and O(log(N)) gate latency.

with the Memory DMs already in place, taking care of separating LOAD from STORE
batches, would it be necessary to do that? (i don't know the answer).

also the ordering is not important (Weak Memory Model) and so a buffer will
*only* have LOADs in or STOREs but not both, and it becomes possible to analyse
the buffer and see if a batch is present.

actually if we had a (really) small CAM that mirrored the way the L1 cache
worked, that might work.  2 maybe 4 lines or something.

then you can detect when the lines are fully occupied...

hmmm needs thought

> Loads/Stores that cross a cache boundary can be split into 2 load/store ops
> when sent to the queue and loads reunited when they both complete. They
> should be relatively rare, so we can probably support reuniting only 1 op
> per cycle.

yes.

> RMW Atomic ops and fences can be put in both load and store queues where
> they are executed once they reach the head of both queues.

these luckily Mitch told me you simply set *both* MemRead *and* MemWrite hazard
and the result is fascinatingly they have to be atomic, locking out all other
LDs, STs, and, because of the way the DMs work, preserve the order as well.

so there will *be* no overlap between atomics, LDs, or STs.

now, whether these all go into the same queue, i don't know.

the main thing is, if we have a queue it has to basically not just be a queue,
it has to be a cache as well.

-- 
You are receiving this mail because:
You are on the CC list for the bug.