[libre-riscv-dev] LD/ST address matcher

Tue Jun 4 22:24:01 BST 2019

On Wednesday, June 5, 2019, Jacob Lifshay <programmerjake at gmail.com> wrote:

> On Tue, Jun 4, 2019, 05:35 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > On Tue, Jun 4, 2019 at 5:59 AM Jacob Lifshay <programmerjake at gmail.com>
> > wrote:
> >
> > > On Mon, Jun 3, 2019, 21:50 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> >
> > > wrote:
> > >
> > > > Which one is it?
> > > >
> > > not quite either, it's more like:
> > > Pixel *textureA = (Pixel *)0x123400000;
> > > Pixel *textureB = (Pixel *)0x567800000;
> > > for(size_t i = 0; i < 0x100000; i++)
> > > {
> > >     textureA[i] = ...;
> > >     textureB[i] = ...;
> > > }
> >
> >  ok so there are massive regular-sized data structures, at fixed
> > memory locations, guaranteed to be on multi-page regular boundaries,
> > where an inner loop will be accessing two such data structures.
> >
> can easily be much more than 2 such data structures.

 Ok so with several..  I was going to suggest just not worrying about it,
and what would happen is (as long as we have the bitmap on the 1st 16 bytes)

The first access to the first word of the data structure would be detected
as single issue (an address hit would occur).

This LD would therefore be paused (not allowed to proceed)

On the next cycle however, being now one cycle BEHIND those LDs that were
racing through the other data structures, it would now SUCCEED in being
multi issued in parallel with other LDs.

Whilst struct1 would be LDing its 1st word, struct2 would be LDing its 2nd.

ie it results in automatic striping.

So, really, I do not see this to be a problem. So some LDs are delayed by 1
cycle, so what, it is just one cycle, and it is highly likely that the
"unused" slot at the beginning of the LD will be filled (used) by the
previous loop end, because *that* is highly likely to be striped,
automatically, too.

I think this is why Mitch said that "in practice", partial address matching
works just fine.

Does that make sense?

I would feel more comfortable with the bitmapping to cover 16 byte LSBs
however have to understand fully what Mitch is saying about the cache line
miss.

I intuitively get it.

> >
> >  ... y'know... one way to avoid the problem is to offset the first
> > data structure when loaded into memory by 16 bytes...
> >
> The Vulkan API mostly requires it.

That's interesting in itself

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68