[libre-riscv-dev] store computation unit

Tue Jun 4 02:47:35 BST 2019

On Tuesday, June 4, 2019, Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> Thx Mitch, going to have to absorb and think this one through.
>
>
> ---------- Forwarded message ----------
> From: *Mitchalsup* <mitchalsup at aol.com>
> Date: Tuesday, June 4, 2019
> Subject: Re: store computation unit
> To: lkcl at lkcl.net
>
>
>
>
> Mitch Alsup
> MitchAlsup at aol.com
>
>
> -----Original Message-----
> From: Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> To: Mitchalsup <mitchalsup at aol.com>; Libre-RISCV General Development <
> libre-riscv-dev at lists.libre-riscv.org>
> Sent: Mon, Jun 3, 2019 5:41 pm
> Subject: Re: store computation unit
>
> On Mon, Jun 3, 2019 at 7:51 PM Mitchalsup <mitchalsup at aol.com> wrote:
>
> > > Got it.  This is remarkably similar to the problem I ran into
> > > on the FU-FU Matrix, where ADD r1 r1 r1 would cause
> > > the FUFU Matrix to create both a spurious RaW *and* WaR hazard...
> > > on itself.
> >
> > This is where my writeup has a flaw:: Write dependencies need to be
> flopped and not driven continuously.
>
> err, err... if you mean that the incoming ST signal is only ASSERTed
> for one clock cycle (when Issue is also raised), that has me
> concerned.
>
> No, what I mean is closer to the contrapositive of what you wrote::
> a) you accumulate the pending write dependencies and store them in a
> vector of flops
> one bit per dependence; the whole vector associated with the FU.
> b) as dependencies relax you clear one flop
> c) as dependent ops are performed, you clear that flop
> d) when the flop contains no set bits, this instruction is no longer being held
> up by dependencies.
>

Right, I got it, this looks remarkably similar to a description of the use
of shadowing for WaW detection, which is what the above is for, am I right?

So STs have to hold up other STs in sequence.  If yes, that's covered by
the overloading of the shadow system so is ok.

>
> hmm, key question (thinking ahead), what happens on multi-issue when
> a LD and a ST is issued simultaneously?  that would create a
> dependency loop: as they're in the same cycle, the LD would create a
> hazard on the ST, and the ST would create a hazard on the LD.
>
> In multi-issue, it is important to flop both the read pending and write
> pending vectors at the start of issue.
>

By flop, I believe you are referring to the 2 phase cycle used in the 6600,
major cycle minor cycle, which they did through posedge and negedge clk if
I recall correctly?

If so, then err, I um removed that bit and managed to get a working
implementation that is near fully combinatorial, the only syncing being in
the CU.

I am seriously hoping this decision (or, more, random experimentation that
finally worked) does not interfere with plans to do multi issue.

>  Each instruction adds dependencies
> (OR) to Read and Write pendings use by younger instruction subject
> to being issued simultaneously.
>

Love that. So simple. Transitive relationships. New big and clevva word of
the day.

>
>
>  Such dependencies are then flopped at
> the FU so the instruction is performed in proper dependence order.
>
> By flopping before adding (OR) you eliminate this potential cycle.
>

argh.

Ok so it goes something like this:

* current cycle has read and write vectors.
* each instruction creates vectors that are transitively merged with the
previous one
* each instruction merges in the global with its local transitive vectors

Let us take for example the very last multi issue instruction (youngest in
order). Even though it its ISSUE signal is being raised at the same time as
the others, it still does NOT have any dependencies other than those it is
supposed to.

i.e even the transitive relationships create a triangle not a full square.
Youngest instruction will have less deps than youngest-1 will have less
deps than youngest-2 and so on.

Hang on though... This works only if there are 2 Matrices working in tandem.

It works with the Registers because the FU Regs Dep Matrix creates the reg
vectors transitively.

Then these are used to fire the FU FU Read/Write Pending signals.

In the case of MemRefs, the WDM is directly equivalent to the FUFU Matrix,
it does *not* receive LD Pending and ST Pending signals that went through a
similar transitive cumulation.

Argh.

This accumulation (which if we can take the global vectors concept from Reg
Read/Write will have to have the multi issued LD/STs be dependent on LDs
*and* STs) is starting to sound very similar to the ( a thru d ) thing you
described above.

Except.. because LDs are RaR and because of the AGEN match... the multi
issue LD/STs do *not* have to be transitively dependent on LD/STs, just
STs, am I right?

>
> seems to me that for multi-issue memory rules, you'd need to stop the
> multi-issue if there were multiple LDs and STs, only allowing LDs to
> be issued, or STs to be issued, in the same cycle, but not both.
> which is quite reasonable, i feel.
>
> Multi-issue of LD+ST just flops the register dependencies, the memory
> dependence matrix dependencies, and the branch shadow dependencies.
> Register dependencies relax on Go_Read and Go_Write,
> memory dependencies relax on AGEN mismatch
> branch dependencies relax upon branch confirmation
>

Got it.

>
>
> > > I realised, firstly, that it *is* a matrix.  In other diagrams I
> misread them
> > > and believed that it was a sparse matrix, with cells only present down
> the diagonal.
> >
> > Woops, the whole matrix is present, but wires change direction along the
> diagonal.
>
> ok :)
>
> > Ah wait, ST depends on ST through WaW hazards, and that is solved not
> > through the LDST Matrix but through the instruction order "shadowing"
> that
> > I added, which creates full WaW dependence.  Does that sound about right?
> >
> > My MDM had STs dependent on other stores and this prescribed a minimal
> order on memory that
> > prevents memory from getting too far out of order wrt program order.
> Shadowing maintains then
> > relaxes the dependencies based on branch boundaries.
>
> i implemented WaW by issuing a "shadow" - not a *branch* shadow, a
> *write* shadow - from every (any) instruction across every (past)
> instruction.
>
> Probably safe, but may constrain parallelism.
>

I anticipate that with enough operand forwarding bus bandwidth, the
constraints will not be so apparent.

Furthermore, by holding lots of FUs up with results that are operand
forwarded, FUs that complete and can drop their read and write dependencies
entirely can become "nameless" and actually do not need to commit to the
register file *at all*.

Thus strangely and as long as there are no exceptions or other shadows it
may actually prove beneficial.

It is a sort of variant of the "nameless" scheme we discussed 6 months ago
on comp.arch.

How many more FUs are needed (an NxN increase in Matrix resources) to make
it effective remains to be seen.

> this fully preserves instruction completion order.  as in, *all*
> instructions may only complete in-order.  the shadow mechanism is
> basically creating that "linked list as a bit-matrix" i mentioned back
> in november on comp.arch.
>
> What about instructions that do not deliver results (STs and Branches
> in particular).
>

"Outstanding instructions in front of us that haven't yet committed" would
unconditionally be reduced by 1 without actually having anything to commit.

Thus if the Register Port Bus Bandwidth could only cope with 4 commits in
any given cycle, if 1 of them was a branch or a ST the "outstanding" count
on all FUs would be reduced by *5* rather than 4.

If there were 2 branches, it would be 6. 2 branches and a ST the count
reduction would be 7 - 7 instructions all retiring at once when the nominal
instruction issue is only supposed to be 4.

> That is the basic idea, but instead of "on every clock" you should
> think "after the oldest instruction writes, there will be at least one
> instruction that can start.
>

Ok. Will try to get my head round that.

>
> hang on... if there are AGEN clashes, they could still be held up, so
> it's important *only* to do the "clash" test on those pending
> (potential) memory operations to be allowed to progress in this clock.
>
> You only relax the MDM on addresses that have been AGENed.
> Addresses that have not been AGENed remain pending.
>
>
> as in: you can't just throw the AGEN'd addresses into a table and
> blindly (unconditionally) do a match: the ones that are *not* being
> considered for forward progress *must* be filtered out.
>

Stated in different words. Believe I have it now.

> we're doing a GPU, and i'm concerned about the lower bits (fine
> granularity on data structures being processed in parallel),
> particularly as the vectorisation is done by transparent multi-issue.
>
> You will want to segregate vector MemRefs into 3 categories
> 1) sequential addresses (stride = 1)
> 2) strided addresses (where stride != 1)
> 3) gather/scatter (where no rules do any good)
> and
> 0) ATOMIC where this effectively looks like a stride = 0 case.
>

What I was hoping for was, due to the weird way in which Vectorisation in
this design is kinda a cheat, (by overloading and abusing the multi issue
concept), was to not even have to do any of that. Except perhaps ATOMIC.

Which I believe is as simple as raising both LD and ST on the AMO ops, a
clue you gave earlier.

In other words, there is actually in effect no such thing as Vectors in
this architecture: the instructions are actually just a compact way of
saying

LD R1 #4
LD R2 #8
LD R3 #12

and so on.

However the regularity still has to be detected, hence wanting to do the
bitmap on bits 0..3 of the AGEN, so as to cover "stride=1 which doesn't
exist by the time elements are in the FUs".

All quite odd, I realise :)

i.e. a vector LD or ST is done by issuing sequential operations that
> sequentially increase both the address to be AGEN'd (a sequential
> offset added) *and* sequentially increase the register number.
>
> So, you get a MissBuffer:LineHit with a CacheMiss so you know you don't
> need to fetch because somebody is already Fetching it. So you end up
> with multiple outstanding LDs waiting for a single Line to be returned from
> DRAM.....
>

Ohh so that's why you don't bother checking AGEN address bits below the
line cache size, huh.

Hmmmm....  how does that work... have to think about it some more.

> we know that these do not overlap (with each other), and yet still
> have to detect overlaps (from other instructions, vectorised or
> otherwise).
>
> so we kiiinda have to do something about that.
>
> You are going to have to compare (at least LOBs) of all AGENs against
> the already AGENed addresses and against the MissBuffer addresses
> in order to properly juggle and separate address dependencies from
> independent memrefs.
>

Ah. Yes. I think I saw sonething about this in the RISCV manual, atomic
section, or perhaps FENCE.

It talked about using both the Virtual *and* the Physical AGEN as miss
detectors. I think.

hmm, i know we got some isopropanol somewhere in the house? *cough
> splutter choke hiccup* :)
>
> Wrong kind of Alcohol.
>

Not the purple stuff, either? ;)

L.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68