[libre-riscv-dev] store computation unit

Mon Jun 3 23:41:06 BST 2019

On Mon, Jun 3, 2019 at 7:51 PM Mitchalsup <mitchalsup at aol.com> wrote:

> > Got it.  This is remarkably similar to the problem I ran into
> > on the FU-FU Matrix, where ADD r1 r1 r1 would cause
> > the FUFU Matrix to create both a spurious RaW *and* WaR hazard...
> > on itself.
>
> This is where my writeup has a flaw:: Write dependencies need to be flopped and not driven continuously.

 err, err... if you mean that the incoming ST signal is only ASSERTed
for one clock cycle (when Issue is also raised), that has me
concerned.

 do you instead mean the *registers* associated with the LD or ST?

> A) you can only be dependent on an instruction older than you are

which is sorted by the horizontal "Issue" signal into the AND gate (in
each cell in the row), which latches only the older instructions (by
column)

> B) with a matching Dest->Source by register name

 taken care of by the FU-Regs Matrix

> C) or ST[addr]->LD[addr] by address.

 taken care of by the address-matcher.

 hmm, key question (thinking ahead), what happens on multi-issue when
a LD and a ST is issued simultaneously?  that would create a
dependency loop: as they're in the same cycle, the LD would create a
hazard on the ST, and the ST would create a hazard on the LD.

seems to me that for multi-issue memory rules, you'd need to stop the
multi-issue if there were multiple LDs and STs, only allowing LDs to
be issued, or STs to be issued, in the same cycle, but not both.
which is quite reasonable, i feel.

> > I realised, firstly, that it *is* a matrix.  In other diagrams I misread them
> > and believed that it was a sparse matrix, with cells only present down the diagonal.
>
> Woops, the whole matrix is present, but wires change direction along the diagonal.

 ok :)

> Ah wait, ST depends on ST through WaW hazards, and that is solved not
> through the LDST Matrix but through the instruction order "shadowing" that
> I added, which creates full WaW dependence.  Does that sound about right?
>
> My MDM had STs dependent on other stores and this prescribed a minimal order on memory that
> prevents memory from getting too far out of order wrt program order. Shadowing maintains then
> relaxes the dependencies based on branch boundaries.

 i implemented WaW by issuing a "shadow" - not a *branch* shadow, a
*write* shadow - from every (any) instruction across every (past)
instruction.

 this fully preserves instruction completion order.  as in, *all*
instructions may only complete in-order.  the shadow mechanism is
basically creating that "linked list as a bit-matrix" i mentioned back
in november on comp.arch.

it works great: i do appreciate it's a little excessive, as literally
every instruction has to be able to cast a shadow across every other
instruction, making the shadow matrix an N-FUs x N-FUs "thing".

 so (indirectly) the actual writing-out of the STs will only be
permitted to occur in-order (strict in-order).  when it comes to
multi-issue i'll have to add a store buffer which understands the
instruction order (from that "counter of number of instructions that
are outstanding to be committed" i mentioned before).

> that last LD is where it goes haywire, because if it is even allowed
> to issue, with LD-Pend being HI *and* ST-Pend being HI, deadlock is
> guaranteed.
>
> A MemRef is only dependent on older MemRefs, and only dependent
> until they AGEN and mismatch or AGEN match and make forward progress.

 so, basically, on every clock, there's going to be at least one
instruction which has zero dependencies, and that one will always be
able to progress.

 hang on... if there are AGEN clashes, they could still be held up, so
it's important *only* to do the "clash" test on those pending
(potential) memory operations to be allowed to progress in this clock.

 as in: you can't just throw the AGEN'd addresses into a table and
blindly (unconditionally) do a match: the ones that are *not* being
considered for forward progress *must* be filtered out.

> > Now this makes sense. AGEN match I had an idea to use a
> > 16 bit bitmap to match the lower address bits and use the truncated
> > idea (bits 4 to 15) for the rest, ignore above bits 15.
>
> Does that sound reasonable, or excessive? Straight compare on bits 4 thru 10 instead perhaps?
>
> I have used bits <11:6> as they are not translated (4KB pages) and larger than a cache line (64 bytes).
> I have used bits <11:4> when the L1 cache was QuadW sized and the L2 cache was Line sized.

 okaaaay.  that makes a lot of sense.  1<<12==4k

> The important thing is that the vast majority of stuff mismatches quickly enabling more parallelism.

 we're doing a GPU, and i'm concerned about the lower bits (fine
granularity on data structures being processed in parallel),
particularly as the vectorisation is done by transparent multi-issue.

i.e. a vector LD or ST is done by issuing sequential operations that
sequentially increase both the address to be AGEN'd (a sequential
offset added) *and* sequentially increase the register number.

we know that these do not overlap (with each other), and yet still
have to detect overlaps (from other instructions, vectorised or
otherwise).

so we kiiinda have to do something about that.

> > In my case an afternoon alternating between staring at graph paper and
> > staring at the ceiling was a close substitute for a clear head...
>
> Suggestion: More alcohol.....

hmm, i know we got some isopropanol somewhere in the house? *cough
splutter choke hiccup* :)

l.