[libre-riscv-dev] LD/ST address matcher

Luke Kenneth Casson Leighton lkcl at lkcl.net
Tue Jun 4 00:39:00 BST 2019

On Tuesday, June 4, 2019, Jacob Lifshay <programmerjake at gmail.com> wrote:

> On Mon, Jun 3, 2019, 15:51 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
> >  can you please confirm: will the Vulkan API produce batches of LDs in
> > quick succession that are *exactly* on a 1MiB boundary?  if so, i'm
> > curious to know why, and if it can be avoided.
> >
> yes, a series of texture reads to apply multiple different textures to a
> surface. the accesses are all to the same location in each of the textures,
> and the memory for textures is aligned to a large power of 2 (at least a
> page, but quite likely a megapage or larger).

Drat. Ok.

So, hm, how about a direct CMP on bits 4-11, bitmap on bits 0-3 and a very
VERY simple non-expensive hash on SOME of the upper bits? Enough to cover
up to 4MiB or so.  We would actually be safe to do the bitmap @ bits 1-3 or
even just 2-3

The compare is NxN way so is EXPENSIVE.

If there are 4 LDs and 4 STs that is an 8x8 address matching matrix, giving
a whopping TWO HUNDRED AND FORTY EIGHT comparisons needed to be done, EVERY
clock cycle, in parallel.

250 XOR gates is something like 4 gates or sonething per XOR, so... 256 x
256 x 4 = a massive 256k gates just to do AGEN hit/miss checking.

Bits 4 to 11 checking would just be AND gates, so that is about.... 8x2=16
gates per address, times 256 = 4k gates for the entire table.

Bits 0 to 3 if done as a bitmap would be 8 variable width shifters (not so
costly) then 16 bits that need checking, so around 8k gates.  Total around

Reducing the bitmap to bits 1-3 would bring the bitmap size down to 1k
gates, and still be able to match against 2 byte LD/ST operations,
including misaligned 32 and 64 bit LD/STs.  Byte level LD/STs would be
treated as single issue only if they had all but the LSB the same.

Plus a bit more, however it is still over an order of magnitude less than a
full 64 bit hash approach.

Remember we do not need a full match, it is quite odd, this is a way to
seek optimisation (parallelism) opportunities, NOT a way to fully AVOID
LD/ST clashes, as that is the LDST Dep Matrixes job, which it has already
done by detecting the order.

Otherwise yes we would definitely need to do a full 64 bit compare which
would be insane.


crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

More information about the libre-riscv-dev mailing list