[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Fri Oct 9 17:12:17 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #46 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
so to explain: anything exclusively "in-flight" which has inter-dependencies is
seriously problematic.

by "in-flight" this refers to all intermediary results that have come from
registers (GPR, XER, CRs) and have not yet been stored back in the same.

so let's go over the micro-architectural design implications of the idea of:

* computing vector results
* creating a comparison vector
* creating a "scalar summary" of that comp vector
* using a micro-op to do so

the requirements in an OoO design are: the entire operation MUST be atomic as
far as dependencies are concerned

OR

it must be "re-entrant" i.e. possible to interrupt, store state, and continue.

(this latter you rejected, jacob, advocating instead "complete the vector op,
disregard latency" which has detrimental implications i won't go into right
now).

so we are going with "full atomic high latency transaction with in-flight
micro-ops" but still permitting cancellation because whilst interrupts are
"reasonable" to ignore, cancellation most certainly is not.

with that in mind, we can begin the architectural design analysis.

* the micro-op must have access to the bitvector of compares in an in-flight
register.
* the micro-op must have access to ALL elements of the in-flight vector (all
64)
* the RESULT vector must ALSO be in-flight i.e. not permitted to write to GPRs
during this time
* why? because the scalar compare is not ready yet and the entire operation is
both atomic and cancellable
* therefore we must have 64 Reservation Stations to hold that in-flight data
* this is PER PIPELINE (!)
* therefore there must be 64 LOGICAL RSes, 64 ALU RSes, 64 FP RSes, 64 SHIFTROT
RSes, 64 DIV RSes.

this comes to a whopping 200+ Reservation Stations which would require
somewhere of the order of 2 million gates.

i'm going through this in detail to show you that even the "simplest-sounding"
idea has far-reaching microarchitectural implications that can end up as
completely impractical to implement.

similar logic also applies to the "simple-sounding" idea of creating bitlevel
Dependency Matrices.

a "re-entrant" design (one that does not have the micro-op creating an
unnecessary dependency) on the other hand can be limited to a "window" (Lanes)
at whatever silicon depth the implementor sees fit to choose.

shadowing (mask cancellation) is actually still a bottleneck in the re-entrant
case, where the in-flight vector is too large for the RSes however it is
perfectly normal to "stall" further issue until the end of the vector
instruction is reached.  thetefore high performance (OoO issue) is achieved by
keeping VL well within the bounds of the available RSes.

however for smaller VLs re-entrancy is perfectly possible and achievable (and
does not cause data corruption when no "shadow" crosses that operation) by
allowing partial vector results to be stored fully before the interrupt point,
then saving the current hardware value of "i" in "for i in range(VL)" as part
of context-switched state.

the conditions here are that you cannot blithely save arbitrary elements
anywhere in the vector, it *must* be "all elements from 0 to the current value
of i" must be saved before the interrupt is allowed to proceed.

any inflight results *may* be mask-cancelled but you have to roll back "i" to
the last fully saved elements before allowing the interrupt/exception to
proceed.

these are the kinds of considerations that need to be taken into account even
for the "simplest" sounding idea! it's pretty mental.

-- 
You are receiving this mail because:
You are on the CC list for the bug.