[libre-riscv-dev] [Bug 215] evaluate minerva for base in libre-soc

Wed Mar 11 19:25:51 GMT 2020

http://bugs.libre-riscv.org/show_bug.cgi?id=215

--- Comment #5 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #4)
> (In reply to Jacob Lifshay from comment #3)
> > do note that compressed texture decoding needs to be able to load 128-bit
> > wide values (a single compressed texture block), 
> 
> okaaay.
> 
> > so our scheduling circuitry
> > should be designed to support that. They should always be aligned, so we
> > won't need to worry about that in the realignment network.
> 
> whew.
> 
> so that's 128-bit-wide for _textures_... that's on the *load* side.  are
> there any simultaneous (overlapping) "store" requirements? are the
> code-loops tight enough to require simultaneous 128-bit LD *and* 128-bit ST?

yes and no -- there is code that will benefit from simultaneous loads and
stores (memcpy and probably most other code that has both loads and stores in a
loop), however it isn't strictly necessary.

It will be highly beneficial to support multiple simultaneous 8, 16, 32, or
64-bit loads to a single cache line all being able to complete simultaneously
independently of alignment in that cache line. Also misaligned loads that cross
cache lines (and possibly page boundaries), though those don't need to complete
in a single cache access.

All the above also applies to stores, though they can be a little slower since
they are less common.

I realize that that will require a really big realignment network, however the
performance advantages I think are worth it.

For a scheduling algorithm for loads that are ready to run (6600-style
scheduler sent to load/store unit for execution, no conflicting stores
in-front, no memory fences in-front), we can have a queue of memory ops and
each cycle we pick the load at the head of the queue and then search from the
head to tail for additional loads that target the same cache line stopping at
the first memory fence, conflicting store, etc. Once those loads are selected,
they are removed from the queue (probably by marking them as removed) and sent
thru the execution pipeline.

We can use a similar algorithm for stores.

To find the required loads, we can use a network based on recursively
summarizing chunks of the queue entries' per-cycle ready state, then reversing
direction from the summary back to the queue entries to tell the entries which,
if any, execution port they will be running on this cycle. There is then a mux
for each execution port in the load pipeline to move the required info from the
queue to the pipeline. The network design is based on the carry lookahead
network for a carry lookahead adder, which allows taking O(N*log(N)) space and
O(log(N)) gate latency.

Loads/Stores that cross a cache boundary can be split into 2 load/store ops
when sent to the queue and loads reunited when they both complete. They should
be relatively rare, so we can probably support reuniting only 1 op per cycle.

RMW Atomic ops and fences can be put in both load and store queues where they
are executed once they reach the head of both queues.

-- 
You are receiving this mail because:
You are on the CC list for the bug.