[libre-riscv-dev] KCP53000B micro-architecture thoughts

Wed May 29 23:13:57 BST 2019

So, I had what I believe to be an epiphany today at work, and I wanted
to get them down in a retrievable, reviewable form before I left the
office for the day.  Luke, if you have the time to review and see if
I'm on the right track here, it'd be very helpful.

I was thinking about how I would implement the IXU for the KCP53000B,
and I realized that the transition from 6502-style PLA-based decoding
to CDC-style FUs would require some fundamental changes in how traps
and exceptions are handled if synchronous trap behavior is to be
preserved.  The 6502 (and, similarly, the existing KCP53000 design)
treats interrupts and traps as /pseudo-instructions/.  That is, they
sample the state of IRQ, NMI, et. al. at the same time as the next
instruction is fetched.  Ergo, regardless of what opcode arrives, IRQ,
NMI, et. al. are treated as the "upper bits" of the opcode, and thus
if set, gets decoded as a special kind of trap "instruction."  It's a
clever hack that works great, and uses a minimum of logic.

However, with the presence of multiple FUs to execute instructions
with, it seems better to take the same approach as a
stack-architecture CPU: the subsequent CPU state is computed as a
function of the current CPU state.  This includes not only the target
register value, but also meta-state, like privilege level and the next
program counter to fetch from if a control flow is required.  *Only
external interrupts appear as higher-order opcode bits.*  Thus, if an
instruction flags an error (e.g., division by zero, say), then the
/resulting/ mstatus bits are set so that the CPU is now in an elevated
privilege, the previous privilege is set correctly, and so forth.
Since the FU has to arbitrate for access to the common data bus (to
use Tomasulo's term; I forget the proper term in Thornton's text at
the moment), these mstatus changes becomes effective if and only if it
ever is granted access to the CDB.

Which it may not be; if FU #1 asserts an ABORT signal on the CDB
before FU #2 successfully arbitrates for the CDB, then FU #2 is
obligated to rescind its arbitration request; thus, FU #2 never drives
its results, /effectively equivalent/ to flushing the FU and returning
it to its quiescent state.  If it's backed by a state machine, ABORT
acts like a reset signal -- in any state, ABORT must cause the FU to
reset to its quiescent state.  Presumably, some other processor
updates will be happening by other units: e.g., CSRU will be
transitioning into an elevated privilege while the IFU is fetching
from the PC stored in mtvec, and so forth.

Some thoughts (in no particular order or consequence):

- This arrangement enforces synchronous behavior, as well as
preserving the fact that external interrupts have the higher priority
over all internally generated synchronous traps.  If instructions with
an "external interrupt" opcode bit is set are issued to a dedicated
"Interrupt Unit" for execution, then it doesn't matter what the
instruction actually is -- it can assert ABORT once it has
successfully arbitrated for the CDB, thus blocking all subsequently
issued instructions from write-back.  Such "instructions" have a null
read reservation, and a write reservation of ALL CPU registers.
Though, honestly, I think this functionality can be embedded directly
into the issue dispatcher; I don't think we need a dedicated FU to
implement it.

- Instead of broadcasting the complete value of mstatus on the CDB,
broadcast the *changes* instead.  This has two benefits: it reduces
the signal routing overhead, it reduces the "flags register" problem
(where all instructions have an implicit read/write dependency on that
register), and it's easy to encode as 1-hot signals, which means you
can OR them together to synthesize the aforementioned ABORT.

- Not all types of FUs will have a need to alter the mstatus register
or related CSRs; deltas for these FUs are *always* implicitly 0, and
so this simplifies circuitry by simply not routing any wires to or
from them.

- As long as *all* FUs are sensitive to an ABORT signal of some kind,
synchronous trap behavior is preserved regardless of how many or what
kind of FUs exist in the processor's hot path.  ABORT also inhibits
instruction issue until all other FUs are known to have reset to a
quiescent state.

- Because the total system state is computed and provided on the CDB
(even if in delta form) as a /function/ of the current system state as
of the time of instruction issue, the individual FU state transition
logic is significantly simplified (think, gated ring counter); or even
mostly eliminated all-together in pipelined versions.

- This approach breaks down hard in superscalar designs, since there
will be multiple CDBs.  To work around this, you'll need more complex
logic to preserve the illusion of synchronous traps.  I don't know
what this logic would look like.  I'm not sure register renaming is
useful here.

-- 
Samuel A. Falvo II