[libre-riscv-dev] store computation unit

Tue Jun 4 23:27:22 BST 2019

On Wednesday, June 5, 2019, Mitchalsup <mitchalsup at aol.com> wrote:

>
>
> So STs have to hold up other STs in sequence.
> Only until it is known that the 2 stores have independent addresses.
>

Got it.

>  If yes, that's covered by the overloading of the shadow system so is ok.
> With branch shadowing, you will not want to "perform" the store until the
> shadow has cleared.
>

Yes. I realised my understanding is out of date, here. You are I believe
referring to a mirror of the FU Regs DM, adapted to cover MemRefs.

> In multi-issue, it is important to flop both the read pending and write
> pending vectors at the start of issue.
>
>
> By flop, I believe you are referring to the 2 phase cycle used in the
> 6600, major cycle minor cycle, which they did through posedge and negedge
> clk if I recall correctly?
> By flop I mean; capture the data in a place where it can be held for at
> least 1 full cycle--
> that is a flip-flop or an SR-flop.
>

Oh ok.

>
> Ok so it goes something like this:
>
> * current cycle has read and write vectors.
> * each instruction creates vectors that are transitively merged with the
> previous one allowing the successive instructions to see the current
> dependence state(s).
>

Got it. Love the simplicity.

> * each instruction merges in the global with its local transitive vectors
>
> Let us take for example the very last multi issue instruction (youngest in
> order). Even though it its ISSUE signal is being raised at the same time as
> the others, it still does NOT have any dependencies other than those it is
> supposed to.
>
> i.e even the transitive relationships create a triangle not a full
> square.  Youngest instruction will have less deps than youngest-1 will have
> less deps than youngest-2 and so on.
>
> While the dependence matrix is always a triangle, you will find that
> operating it circularly you will want it implemented as a square. You
> operate it circularly by allowing completed instructions to leave and
> entering new instructions into the <now> oldest slot.
>

Yes. Software would have a triangle. HW requires allocation of all full
potential resources that might get used.

Bit weird for a software engineer to grasp, that one.

>
> Hang on though... This works only if there are 2 Matrices working in
> tandem.
>
> It works with the Registers because the FU Regs Dep Matrix creates the reg
> vectors transitively.
>
> MemRefs are dependent on register data-flow order and on memory data-flow
> order. {And on branch
> shadow order, too.}
>
> Then these are used to fire the FU FU Read/Write Pending signals.
>
> In the case of MemRefs, the MDM is directly equivalent to the FUFU Matrix,
> it does *not* receive LD Pending and ST Pending signals that went through a
> similar transitive cumulation.
>
> No, but it does see the Go_LD[k] and Go_St[k] signals. which relax the
> dependences.
>

Yes. Direct mirror equivalent to FURegs-FUFU using GORD and GOWR.

> Argh.
>
> This accumulation (which if we can take the global vectors concept from
> Reg Read/Write will have to have the multi issued LD/STs be dependent on
> LDs *and* STs) is starting to sound very similar to the ( a thru d ) thing
> you described above. BINGO
>

:)

>
> Except.. because LDs are RaR and because of the AGEN match... the multi
> issue LD/STs do *not* have to be transitively dependent on LD/STs, just
> STs, am I right?
>
> LD<->LD are independent in order (reading does not harm)
>

Except for IO Domain. I am seriously considering just doing LD - LD deps
for this reason, and to actually just reuse the FURegs Matrix source code,
the "registers" being the LDs and STs.

RD_Pending would become LD_Pending, WR would become ST_Pending.

Obviously operand2 "reg" would not exist.

This because it was such a pig getting FURegs and FUFU right I am
disinclined to alter or copy the code!

> ST<->LD are dependent so the ST happens before the LD
>                when ST[address] mismatches LD[address] the load becomes
> independent.
>

This is where the parallel between FURegs and LDST is different.

In FURegs, we *know* that the regs are different, because, doh, the reg
nums are different.

The MemRef version of FURegs does *not* know if they are different because
the address hasn't been computed yet.

So we assume that they are - this is why it is NxN Matrix where FURegs is
NxM, N=FUs, M=Regs - and proceed to drop deps once address clashes are
known.

>                OR when the ST is performed the LD is enabled
> LD<->ST are dependent so the LD happens before the ST
>                when LD[address] mismatches ST[address] the store becomes
> independent.
>                OR when the LD is performed the ST is enabled.
>

These ORs are the direct mirror of the big Read_Pending OR gates for FURegs.

> ST<->STare dependent so the ST[1] happens before the ST[2]
>                when ST[1][address] mismatches ST[2][address] the ST[2]
> becomes independent.
>                OR when ST[1] is performed ST[2] is enabled.
>
>
>
>

we're doing a GPU, and i'm concerned about the lower bits (fine
> granularity on data structures being processed in parallel),
> particularly as the vectorisation is done by transparent multi-issue.
>
> You will want to segregate vector MemRefs into 3 categories
> 1) sequential addresses (stride = 1)
> 2) strided addresses (where stride != 1)
> 3) gather/scatter (where no rules do any good)
> and
> 0) ATOMIC where this effectively looks like a stride = 0 case.
>
>
> What I was hoping for was, due to the weird way in which Vectorisation in
> this design is kinda a cheat, (by overloading and abusing the multi issue
> concept), was to not even have to do any of that. Except perhaps ATOMIC.
>
> You will find it profitable to recognize the 3(4) above patterns in the
> sequencer.
>

Ok. Will bear it in mind. Am guessing this will hit home when it comes to
cache line missing and fiing etc, which is an area I have not yet put
enough thought into.

>
> i.e. a vector LD or ST is done by issuing sequential operations that
> sequentially increase both the address to be AGEN'd (a sequential
> offset added) *and* sequentially increase the register number.
>
> So, you get a MissBuffer:LineHit with a CacheMiss so you know you don't
> need to fetch because somebody is already Fetching it. So you end up
> with multiple outstanding LDs waiting for a single Line to be returned from
> DRAM.....
>
>
> Ohh so that's why you don't bother checking AGEN address bits below the
> line cache size, huh.
> That and address Aliasing::
>
> VA[0x10xxx] -> PA[0x567xxx]
> VA[0x22xxx] -> PA[0x567xxx]
>
> So a ST[0x10234] actually does not mismatch to LD[22234] because both map
> to the same
> place in memory !
>

Ngghh brainmelt.

Virtual addresses have the same lower bits as physical addresses, up to 4k
[bit 12].

So we have to take up to those 4k, because if we start going beyond that it
would be necessary to incorporate the PA bits into the miss, and that means
going through the cache, and consequently DELAYING things and generally
making it massively complicated.

I am hoping that Jacob can confirm that if Vulkan has data structures on
fixed 1MiB boundaries, we do not need to take that into account.

Vulkan, Jacob informs me, requires texture data structures (which are
absolutely massive and very regular) to be accessed in sequential order.

So there are *genuine* regular "miss" opportunities that only up to 12
address bits would not catch, turning them into single issue LD/ST
operations and hurting performance.

What I am hoping is that a large batch of LDs on one of the data structures
will be delayed (held up) by 1 or 2 clocks, such that the large batch of
LDs on the data structure separated by *exactly* 1MiB will "stripe"
(interleave, 1 or 2 clocks delayed) with the first batch, even if we do not
use the "bitmap" idea on the last 4 bits of the address.

>
>
> Wrong kind of Alcohol.
>
>
> Not the purple stuff, either? ;)
>
> Hint: in pure form it is clear with slight hints of sweetness.
>
>
Ohh, C2H4OH, yes, had to stop drinking it after it had an effect on me
similar to this:
https://www.youtube.com/watch?v=2TQuacxEjAU

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68