[libre-riscv-dev] LD/ST address matcher

Tue Jun 4 01:15:08 BST 2019

On Tuesday, June 4, 2019, Jacob Lifshay <programmerjake at gmail.com> wrote:

> I think the hash technique would be less expensive than your estimate:
> an and gate is 6 transistors (4 for nand then 2 for inverter)
> an xor gate can be 6 transistors: http://tinyurl.com/yyygzebc
>
>
250 XOR gates compared to 12 AND gates is still 20 times more expensive.

Then multiply by 248 or so because it is NxN compares where N is likely to
be 8.

It's not xN, because *every* address has to be checked against every other
address, simultaneously.

Ok we can do 7 6 5 4 3 2 1 which is pascal triangle size, it is still far
too high, the numbers still come out at insane numbers for the full hash
method.

We have to do better.

> On Mon, Jun 3, 2019, 16:39 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > On Tuesday, June 4, 2019, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >
> > > On Mon, Jun 3, 2019, 15:51 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> >
> > > wrote:
> > >
> > > >  can you please confirm: will the Vulkan API produce batches of LDs
> in
> > > > quick succession that are *exactly* on a 1MiB boundary?  if so, i'm
> > > > curious to know why, and if it can be avoided.
> > > >
> > > yes, a series of texture reads to apply multiple different textures to
> a
> > > surface. the accesses are all to the same location in each of the
> > textures,
> > > and the memory for textures is aligned to a large power of 2 (at least
> a
> > > page, but quite likely a megapage or larger).
> >
> >
> > Drat. Ok.
> >
> > So, hm, how about a direct CMP on bits 4-11, bitmap on bits 0-3 and a
> very
> > VERY simple non-expensive hash on SOME of the upper bits? Enough to cover
> > up to 4MiB or so.  We would actually be safe to do the bitmap @ bits 1-3
> or
> > even just 2-3
> >
> > The compare is NxN way so is EXPENSIVE.
> >
> > If there are 4 LDs and 4 STs that is an 8x8 address matching matrix,
> giving
> > a whopping TWO HUNDRED AND FORTY EIGHT comparisons needed to be done,
> EVERY
> > clock cycle, in parallel.
> >
> > 250 XOR gates is something like 4 gates or sonething per XOR, so... 256 x
> > 256 x 4 = a massive 256k gates just to do AGEN hit/miss checking.
> >
> > Bits 4 to 11 checking would just be AND gates, so that is about....
> 8x2=16
> > gates per address, times 256 = 4k gates for the entire table.
> >
> > Bits 0 to 3 if done as a bitmap would be 8 variable width shifters (not
> so
> > costly) then 16 bits that need checking, so around 8k gates.  Total
> around
> > 12k.
> >
> > Reducing the bitmap to bits 1-3 would bring the bitmap size down to 1k
> > gates, and still be able to match against 2 byte LD/ST operations,
> > including misaligned 32 and 64 bit LD/STs.  Byte level LD/STs would be
> > treated as single issue only if they had all but the LSB the same.
> >
> > Plus a bit more, however it is still over an order of magnitude less
> than a
> > full 64 bit hash approach.
> >
> > Remember we do not need a full match, it is quite odd, this is a way to
> > seek optimisation (parallelism) opportunities, NOT a way to fully AVOID
> > LD/ST clashes, as that is the LDST Dep Matrixes job, which it has already
> > done by detecting the order.
> >
> > Otherwise yes we would definitely need to do a full 64 bit compare which
> > would be insane.
> >
> > L.
> >
> >
> >
> > --
> > ---
> > crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
> > _______________________________________________
> > libre-riscv-dev mailing list
> > libre-riscv-dev at lists.libre-riscv.org
> > http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
> >
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68