[libre-riscv-dev] Fwd: store computation unit
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Tue Jun 4 00:58:14 BST 2019
Thx Mitch, going to have to absorb and think this one through.
---------- Forwarded message ----------
From: *Mitchalsup* <mitchalsup at aol.com>
Date: Tuesday, June 4, 2019
Subject: Re: store computation unit
To: lkcl at lkcl.net
MitchAlsup at aol.com
From: Luke Kenneth Casson Leighton <lkcl at lkcl.net>
To: Mitchalsup <mitchalsup at aol.com>; Libre-RISCV General Development <
libre-riscv-dev at lists.libre-riscv.org>
Sent: Mon, Jun 3, 2019 5:41 pm
Subject: Re: store computation unit
On Mon, Jun 3, 2019 at 7:51 PM Mitchalsup <mitchalsup at aol.com> wrote:
> > Got it. This is remarkably similar to the problem I ran into
> > on the FU-FU Matrix, where ADD r1 r1 r1 would cause
> > the FUFU Matrix to create both a spurious RaW *and* WaR hazard...
> > on itself.
> This is where my writeup has a flaw:: Write dependencies need to be
flopped and not driven continuously.
err, err... if you mean that the incoming ST signal is only ASSERTed
for one clock cycle (when Issue is also raised), that has me
No, what I mean is closer to the contrapositive of what you wrote::
a) you accumulate the pending write dependencies and store them in a vector
one bit per dependence; the whole vector associated with the FU.
b) as dependencies relax you clear one flop
c) as dependent ops are performed, you clear that flop
d) when the flop contains no set bits, this instruction is no longer being held
up by dependencies.
do you instead mean the *registers* associated with the LD or ST?
> A) you can only be dependent on an instruction older than you are
which is sorted by the horizontal "Issue" signal into the AND gate (in
each cell in the row), which latches only the older instructions (by
> B) with a matching Dest->Source by register name
taken care of by the FU-Regs Matrix
> C) or ST[addr]->LD[addr] by address.
taken care of by the address-matcher.
hmm, key question (thinking ahead), what happens on multi-issue when
a LD and a ST is issued simultaneously? that would create a
dependency loop: as they're in the same cycle, the LD would create a
hazard on the ST, and the ST would create a hazard on the LD.
In multi-issue, it is important to flop both the read pending and write
pending vectors at the start of issue. Each instruction adds dependencies
(OR) to Read and Write pendings use by younger instruction subject
to being issued simultaneously. Such dependencies are then flopped at
the FU so the instruction is performed in proper dependence order.
By flopping before adding (OR) you eliminate this potential cycle.
seems to me that for multi-issue memory rules, you'd need to stop the
multi-issue if there were multiple LDs and STs, only allowing LDs to
be issued, or STs to be issued, in the same cycle, but not both.
which is quite reasonable, i feel.
Multi-issue of LD+ST just flops the register dependencies, the memory
dependence matrix dependencies, and the branch shadow dependencies.
Register dependencies relax on Go_Read and Go_Write,
memory dependencies relax on AGEN mismatch
branch dependencies relax upon branch confirmation
> > I realised, firstly, that it *is* a matrix. In other diagrams I
> > and believed that it was a sparse matrix, with cells only present down
> Woops, the whole matrix is present, but wires change direction along the
> Ah wait, ST depends on ST through WaW hazards, and that is solved not
> through the LDST Matrix but through the instruction order "shadowing" that
> I added, which creates full WaW dependence. Does that sound about right?
> My MDM had STs dependent on other stores and this prescribed a minimal
order on memory that
> prevents memory from getting too far out of order wrt program order.
Shadowing maintains then
> relaxes the dependencies based on branch boundaries.
i implemented WaW by issuing a "shadow" - not a *branch* shadow, a
*write* shadow - from every (any) instruction across every (past)
Probably safe, but may constrain parallelism.
this fully preserves instruction completion order. as in, *all*
instructions may only complete in-order. the shadow mechanism is
basically creating that "linked list as a bit-matrix" i mentioned back
in november on comp.arch.
What about instructions that do not deliver results (STs and Branches
it works great: i do appreciate it's a little excessive, as literally
every instruction has to be able to cast a shadow across every other
instruction, making the shadow matrix an N-FUs x N-FUs "thing".
so (indirectly) the actual writing-out of the STs will only be
permitted to occur in-order (strict in-order). when it comes to
multi-issue i'll have to add a store buffer which understands the
instruction order (from that "counter of number of instructions that
are outstanding to be committed" i mentioned before).
> that last LD is where it goes haywire, because if it is even allowed
> to issue, with LD-Pend being HI *and* ST-Pend being HI, deadlock is
> A MemRef is only dependent on older MemRefs, and only dependent
> until they AGEN and mismatch or AGEN match and make forward progress.
so, basically, on every clock, there's going to be at least one
instruction which has zero dependencies, and that one will always be
able to progress.
That is the basic idea, but instead of "on every clock" you should
think "after the oldest instruction writes, there will be at least one
instruction that can start.
hang on... if there are AGEN clashes, they could still be held up, so
it's important *only* to do the "clash" test on those pending
(potential) memory operations to be allowed to progress in this clock.
You only relax the MDM on addresses that have been AGENed.
Addresses that have not been AGENed remain pending.
as in: you can't just throw the AGEN'd addresses into a table and
blindly (unconditionally) do a match: the ones that are *not* being
considered for forward progress *must* be filtered out.
As stated above.
> > Now this makes sense. AGEN match I had an idea to use a
> > 16 bit bitmap to match the lower address bits and use the truncated
> > idea (bits 4 to 15) for the rest, ignore above bits 15.
> Does that sound reasonable, or excessive? Straight compare on bits 4 thru
10 instead perhaps?
> I have used bits <11:6> as they are not translated (4KB pages) and larger
than a cache line (64 bytes).
> I have used bits <11:4> when the L1 cache was QuadW sized and the L2
cache was Line sized.
okaaaay. that makes a lot of sense. 1<<12==4k
> The important thing is that the vast majority of stuff mismatches quickly
enabling more parallelism.
we're doing a GPU, and i'm concerned about the lower bits (fine
granularity on data structures being processed in parallel),
particularly as the vectorisation is done by transparent multi-issue.
You will want to segregate vector MemRefs into 3 categories
1) sequential addresses (stride = 1)
2) strided addresses (where stride != 1)
3) gather/scatter (where no rules do any good)
0) ATOMIC where this effectively looks like a stride = 0 case.
i.e. a vector LD or ST is done by issuing sequential operations that
sequentially increase both the address to be AGEN'd (a sequential
offset added) *and* sequentially increase the register number.
So, you get a MissBuffer:LineHit with a CacheMiss so you know you don't
need to fetch because somebody is already Fetching it. So you end up
with multiple outstanding LDs waiting for a single Line to be returned from
we know that these do not overlap (with each other), and yet still
have to detect overlaps (from other instructions, vectorised or
so we kiiinda have to do something about that.
You are going to have to compare (at least LOBs) of all AGENs against
the already AGENed addresses and against the MissBuffer addresses
in order to properly juggle and separate address dependencies from
> > In my case an afternoon alternating between staring at graph paper and
> > staring at the ceiling was a close substitute for a clear head...
> Suggestion: More alcohol.....
hmm, i know we got some isopropanol somewhere in the house? *cough
splutter choke hiccup* :)
Wrong kind of Alcohol.
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
More information about the libre-riscv-dev