[libre-riscv-dev] LD/ST address matcher

Wed Jun 5 06:37:01 BST 2019

Note that texture access is just one of the places that memory is accessed
in sequence separated by exact multiples of 1 MiB. Other places are
fetching vertex parameters from multiple separate buffers (Quite common,
maybe even more than multiple textures), and writing to multiple output
images. Having texture fetch instructions is still necessary, but we should
also support loads separated by exact multiples of large powers of 2.

Note that the inner rendering loop can easily be too big to fit in the
issue queue, so relying on multiple loop iterations executing at once is a
non-starter.

On Tue, Jun 4, 2019, 14:24 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On Wednesday, June 5, 2019, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > On Tue, Jun 4, 2019, 05:35 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> > wrote:
> >
> > > On Tue, Jun 4, 2019 at 5:59 AM Jacob Lifshay <programmerjake at gmail.com
> >
> > > wrote:
> > >
> > > > On Mon, Jun 3, 2019, 21:50 Luke Kenneth Casson Leighton <
> lkcl at lkcl.net
> > >
> > > > wrote:
> > > >
> > > > > Which one is it?
> > > > >
> > > > not quite either, it's more like:
> > > > Pixel *textureA = (Pixel *)0x123400000;
> > > > Pixel *textureB = (Pixel *)0x567800000;
> > > > for(size_t i = 0; i < 0x100000; i++)
> > > > {
> > > >     textureA[i] = ...;
> > > >     textureB[i] = ...;
> > > > }
> > >
> > >  ok so there are massive regular-sized data structures, at fixed
> > > memory locations, guaranteed to be on multi-page regular boundaries,
> > > where an inner loop will be accessing two such data structures.
> > >
> > can easily be much more than 2 such data structures.
>
>
>  Ok so with several..  I was going to suggest just not worrying about it,
> and what would happen is (as long as we have the bitmap on the 1st 16
> bytes)
>
> The first access to the first word of the data structure would be detected
> as single issue (an address hit would occur).
>
> This LD would therefore be paused (not allowed to proceed)
>
> On the next cycle however, being now one cycle BEHIND those LDs that were
> racing through the other data structures, it would now SUCCEED in being
> multi issued in parallel with other LDs.
>
> Whilst struct1 would be LDing its 1st word, struct2 would be LDing its 2nd.
>
> ie it results in automatic striping.
>
> So, really, I do not see this to be a problem. So some LDs are delayed by 1
> cycle, so what, it is just one cycle, and it is highly likely that the
> "unused" slot at the beginning of the LD will be filled (used) by the
> previous loop end, because *that* is highly likely to be striped,
> automatically, too.
>
> I think this is why Mitch said that "in practice", partial address matching
> works just fine.
>
> Does that make sense?
>
> I would feel more comfortable with the bitmapping to cover 16 byte LSBs
> however have to understand fully what Mitch is saying about the cache line
> miss.
>
> I intuitively get it.
>
>
> > >
> > >  ... y'know... one way to avoid the problem is to offset the first
> > > data structure when loaded into memory by 16 bytes...
> > >
> > The Vulkan API mostly requires it.
>
>
> That's interesting in itself
>
>
>
> --
> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>