[libre-riscv-dev] [Bug 257] Implement demo Load/Store queueing algorithm

Thu Mar 26 21:23:08 GMT 2020

http://bugs.libre-riscv.org/show_bug.cgi?id=257

--- Comment #20 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #19)
> I was thinking that the load/store FUs would just split and submit both
> halves to a queue-like structure that and the FU would just track both
> halves independently until they complete, then merge them. If a load/store
> doesn't need to be split, then it would only take one queue entry.

yes - only submit one queue entry, however there's a second split, in order
to double the maximum throughput (without needing dual-ported L1 cache SRAM):
use bit 4 to select odd-even banks.

however that selection needs to be back at the start:
https://libre-riscv.org/3d_gpu/l0_to_l1_n_way_cache_buffer.png

so we need *two* multi-in 4-out PriorityPickers, where the inputs are
actually done like this:

for i in range(16):
   if addr[i][4] == 0:
      multi_picker_even[i] = ldst_requested[i]
   else:
      multi_picker_odd[i] = ldst_requested[i]

*then* followed by:

mpick_even = MultiPriorityPicker(multi_picker_even, 4)
mpick_odd  = MultiPriorityPicker(multi_picker_odd, 4)

*then* followed by:

for i in range(4):
     L0_cache_buffer_even.port[i] == mpick_even.out_en[i] # each 16-bit wide
     L0_cache_buffer_odd .port[i] == mpick_odd .out_en[i] # ditto

that L0_cache_buffer_odd/even, yes it can be a Queue, however a FIFO Queue
has a bit of a problem: you can't normally get access to the entries in it.
where we need to *merge* entries (the bytemasks) that happen to have the same 
address bits 5 and upwards.

> That queue can have all the dependency matrix and scheduling stuff be part
> of it and it only needs enough entries to keep the pipeline full -- it
> doesn't need double entries to handle split load/stores since the FUs will
> just stall if the queue is too full.

part of the discussion on comp.arch was that we *may* want to prioritise
aligned LDs/STs over misaligned ones.

this can easily be done by putting the misaligned LDs/ST requests at the
*end* of the PriorityPicker.

however, again, if there happens to be a sequential batch of LDs/STs
which happen to be misaligned because they are a massive sequential batch of
8-byte LDs that were offset accidentally by 4 bytes, we do *not* want
misaligned LDs/STs to be deprioritised, because they will actually merge
quite nicely with the other LDs/STs.

therefore (and i'm recommending we do this later) we actually need to
prioritise which entry should be sent from the L0 cache/buffer/queue
by the number of bits that "merge" at the bytemask level.  this with
a "popcount" (or pseudo-popcount) on every entry in the L0 cache/buffer/queue
that detects at the very minimum which are full cache-lines after merging
the byte-masks which all have the same binary addresses [bits 5 and upwards].

*not* by "whether it is first in the queue".

so ultimately, a queue (FIFO queue) is not the right data structure to use.

if you'd like to write a round-robin Picker, great! :)  however i don't
think it makes a huge amount of difference, initially, whether it's
Priority-picked or RR-picked.  the PP and Multi-PP code exists :)

-- 
You are receiving this mail because:
You are on the CC list for the bug.