[libre-riscv-dev] KCP53000B micro-architecture thoughts

Thu May 30 12:10:06 BST 2019

On Thu, May 30, 2019 at 6:35 AM Samuel Falvo II <sam.falvo at gmail.com> wrote:
>
> On Wed, May 29, 2019 at 7:41 PM Luke Kenneth Casson Leighton
> <lkcl at lkcl.net> wrote:
> >  took me about 5-6 months of study, 2 of which were near-full-time
> > communication with mitch.
>
> Oh, good.  I don't feel so bad about myself now!  ;-)

 :)

> (In a subsequent response, you mention go_die as being the same as
> ABORT; I'll need to re-read Alsup's texts in more detail later with
> this in mind.  His naming conventions frequently confuse me, and his
> lack of timing diagrams to illustrate chronological relationships
> between signals frequently leaves my head spinning.

 it's described in words.

> You may recall I
> frequently have confusion around the timing of GO_READ and GO_WRITE,
> etc.)

 the ordering is:

 * issue
 * go_read
 * go_write

and these *must* occur on separate clock cycles.  go_die can cancel
(simultaneously) both go_read and go_write.

the computation unit generates three critical signals which, because
of the potential for combinatorial loops, *must* be
clock-delay-synced:

* busy
* read_release
* write_release

according to Thornton and Mitch, you are supposed to do this (i am
ignoring that the clock is both POS- and NEG- edge driven):

* set the FU-Regs Dependency Matrix latches on one [half-] clock edge
* set the FU-FU Dependency Matrix latches on the next [half-] clock edge
* set the Computation Unit latches on the *NEXT* [half-] clock edge

in the design that i have written, i do *not* have clock-delays
between the matrices: instead i use a mish-mash of comb and sync on
the SRLatch pseudo-implementation, and it seems to work really well.

what i *do* have is that the Computation Unit sets its latch inputs
signals ONE clock delay behind the FU-Regs and FU-FU Matrices.

this is because the Computation Unit is what generates the absolutely
critical "loop" information (busy, read_release, write_release) that
completely changes the state for the next clock cycle.

however, the busy, read_release and write_release can be safely
generated combinatorially from the Computation Unit's SRLatch output
information, because the "firebreak" (those latches being set using
sync) has already occurred.

here's one *row* of the Dependency Matrix - don't be fooled by the
SRLatch module, i made SRLatch a multi-bit *vector* of SR-Latches:
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/scoreboard/dependence_cell.py;hb=HEAD#l55

this is the Computation Unit.  note the fact that whilst the SRLatch
is asynchronous (combinatorial output based on its inputs), the
latches *inputs* are set with SYNC.
https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/experiment/compalu.py;h=f517f5ccc7ac63b74fd8f802ee333272bba29251;hb=d82aea2ddc957f5135227bb2439e702a961f4d4c#l80

and here's SRLatch (and also a function called "latchregister"):
https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/nmutil/latch.py;h=460661ba30582a5b6a006602a27850ccb1fc0de3;hb=7ba19cf0b3212f0bbfa9bc1a88174027406e1e87#l24

latchregister is basically a DFF however its output is set
*combinatorially*, so it acts just like a pass-through Memory Cell.
this same trick i use in the SRLatch code.

> >  that's just a change of some flag that changes the behaviour of
> > instructions.  as long as that flag is carried along with the
> > instructions, to the FUs that need to respect that flag, what is the
> > problem?
>
> That's not the problem I was considering.  The problem I was thinking
> about was where to put the logic which /alters/ those flags.

 according to the scoreboard-style design concept / ethos / strategy,
that bizarrely doesn't actually matter as much as the protection *of*
the flags through the Dep-Matrices expressing the dependencies of
access *to* the logic that reads - and writes - the flag.

 does that make any sense?

 this is why i am mulling over the idea of treating CSRs (and other
state-changing "stuff") as "something to be protected by a Dependency
Matrix".

> A
> dedicated "interrupt unit" is one possibility.  Keeping it in the
> instruction dispatcher is another.  I was just speculating on
> different design approaches.

 yehyeh.  the CSR-Dep-Matrix concept... i dunno, it seems to me to be
something that i've never seen described anywhere.

 hmmm i think it's time for some comp.arch and hw-dev posts, there.

> Things to consider for the future perhaps; I'm just trying to build up
> an understanding and an implementation to test my understanding.  I'm
> seriously considering abandoning the 6502-style "one giant PLA to rule
> them all" decoder in favor of a *small* out-of-order approach, just to
> play with the ideas and confirm my understanding (one CSR-FU, one
> load/store unit which also does addition/subtraction, one logical unit
> to do XOR, AND, etc., and one branch unit to deal with JAL, JALR, and
> the conditional branches).

 sounds very reasonable to me, as long as each of those are one-cycle
Computation Units, or preferably combinatorial, so that there's no
additional delay when the CU gets its operands.

 if there's an extra cycle of delay, then if you have only one CU per
type of operand, and there are N such operations in a row (LD followed
by ADD, ADD, ADD, ADD, then ST, then ADD SUB SUB for example are all
the same type), the results will be on every THIRD clock cycle because
issue MUST be raised on a SEPARATE cycle from go_read which MUST be
raised on a separate cycle from go_write.

 if you can put up with that (knowing in advance that it's an
acceptable trade-off), then yes it will make the overall design
simpler and smaller.

 if it's not acceptable then maybe add one extra CU - an ADD/SUB CU -
now you have 2 which can handle ADD/SUB, and you'd only get a 2/3
stalling when a LD-ADD-ADD-ST-ADD-SUB sequence is encountered.

> Then, after that, I was thinking of going
> back to re-read Thornton and Mitch Alsup's chapters with my design
> experience helping to inform my understanding and learn some of the
> finer details.

 yeah that's a really good strategy.  learn by doing.  honestly i
couldn't have got where i am without (a) actually implementing it (b)
having someone to help and discuss things.

 so, am more than happy to do the same for you.

> >  unfortunately, i've not thought through the CSRs enough to be able to
> > comment.  off the top of my head, though, if a CSR may be affected by
> > any given instruction, then to my mind it makes sense to simply treat
> > it as... just another destination "register" as part of the Dependency
> > Matrices (this time it would be the "CSR Register Dep Matrix", being
> > the 3rd Reg-Matrix, next to FP Reg Matrix and INT Reg Matrix)
>
> This makes complete sense to me; I'm implicitly trying to make my
> design small and compact enough for an iCE40HX8K FPGA.  Considering
> the number of CSRs, each row of the matrix I think will end up using
> more DFFs in an FPGA than I can support in such a small package.

 how many writable CSRs do you need, again?

> (It probably still won't fit; but, I'd like to try anyway.  :) )

 :)

 can you do me a favour (and yourself), and test SRLatch(sync=False,
llen=8), running it through the FPGA conversion/compile process and
let me know how many LUTs it uses?

 or, can you point me at a tutorial or a Makefile that will let me find out?

 attached is the yosys graph for the SRLatch after proc and opt are
run.  yes, that's a *multi-bit* q/q_int, *multi-bit* s and *multi-bit*
r input, so all the DFF, MUX, OR, AND and NOT are all 4-bit wide.

 yes it's what's known as a "set-prioritising" SR-NOR Latch, it's
stable because if S is set, R is ignored.

 all that logic just to replace what 2 NOR gates would do... *sigh*...

l.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2019-05-30_12-04.jpg
Type: image/jpeg
Size: 23909 bytes
Desc: not available
URL: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/attachments/20190530/01310af6/attachment-0001.jpg>