[libre-riscv-dev] [Bug 276] SR NAND Latch needed in nmigen

Fri Apr 3 13:05:29 BST 2020

http://bugs.libre-riscv.org/show_bug.cgi?id=276

--- Comment #2 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to whitequark from comment #1)
> (The following comment is copied from the email I sent earlier, so that it
> is accessible publicly.)

appreciated, whitequark.

summary (the rest is archive-suitable / context) is: we'd like to be able
to do "exact" netlist (we already have semi-equivalent, and this is working).

> Of the things you mentioned, this includes only the SR latch issue.
> 
> This issue is unfortunately quite involved. (n)Migen is designed to
> connect large islands of purely synchronous logic with a few async
> bridges. It does not, very deliberately, directly support arbitrary
> asynchronous primitives like the SR latch. So you can't just drop one
> into your design and have it work the same way normal logic works.

yes.  very much aware that standard proprietary commercial tools completely
dropped support for SR latches.

(reiterating this for the archives:
however we really cannot use DFFs, here.  that is 10 gates rather than 2,
and the number of Cells needed is massive: one of the Matrices may need
to be 128 x 30, with at least four maybe five latches per Cell.  that's
192,000 gates *just for that Matrix* if we use DFFs.
if we use SR Latches, it's 38400 gates, which is tolerable)

> However, you're in luck because better support for asynchronous
> signals was something I took into account when designing nMigen and
> both of the new simulators, pysim and cxxsim.

whew :)

> To solve this issue, we'll need to work together to determine your
> specific use case. Do you want to use the exact same netlist for
> nMigen simulation and synthesis?

i honestly don't know: we're open to suggestions.  (having read ahead)
if we can do both (and use one for rapid prototyping and the other as
a cross-check before moving to synthethis) that would be great.

right now we have something that "works" however
it uses DFFs, not SR latches (see latch.py link, below).

> If not, it means you can model
> whatever it is that contains the SR latches (I'm not sure what
> matrices you're referring to, here)

Out-of-Order Read/Write Hazard detection and avoidance matrices.
see "Modifications to Dependency Cell"
https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/

* the DMs encode all registers in Unary (single bit activation).
  thus one "row" represents all registers used for a given FunctionUnit
  [and access to its ALU].
* each Cell thus records, in bit-level (unary) form, the fact that read/write
  for a given function (add) is needed, by raising the "Reg 5 needs READ"
  and "Reg 7 needs WRITE" SR Latches
* these now-active signals indicate to subsequent instructions
  "hello i have a read hazard, hello i have a write hazard" respectively.
  attempts to use those registers thus STOPs that FunctionUnit from executing.
* hazards get cleared out once results are written
* when there are zero hazards (in any given row), that instruction
  becomes free-and-clear to proceed.

it's actually incredibly simple.

here's one for Register-to-FunctionUnit (FU being the "arbitrator" for access
to an ALU pipeline), you have to drill down a bit (through DependencyRow)
to get to the SR Latches themselves

https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/fu_reg_matrix.py;h=06380434c8d7d20828d80c3d0e020161bcb2c2e4;hb=b01f6bc0ad28cda131beae33e5fe338daaf5e9ea

> using synchronous code, 

funnily enough that's where we are right now:
https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/latch.py;h=7d6a1efe22c881585a626e397590337186f6ef1b;hb=HEAD#l41

> and
> replace them with true SR latches for synthesis. Of course, you will
> need to convince yourself these two designs are equivalent.

yes.  fortunately, coriolis2 has gate-level simulation (which needs
investigation) so we can triple-check.

> If you *do* want to use the exact same netlist, there are a few
> options. Both the new nMigen simulator and cxxrtl use an architecture
> that, in principle, supports logic feedback loops. So you can model
> the SR latch using two ordinary NAND gates. If you have an instance
> of a SRNAND cell, you can provide a simulation model for this instance
> that uses two NAND gates, and it'll work. 

with the caveat that the current SRLatch class is actually a Reset-priority
SR latch, i *believe* we already have the former, so this latter is really
what we'd like to have.

we can always flip in/out the current SR Latch for a properly asynchronous
one (exact same netlist), run some (unbelievably slow) tests, then generate
the actual netlist to be handed to yosys (and from there to coriolis2).

> However, I must warn you
> that both for nMigen pysim and cxxrtl, this will come with a severe
> performance penalty of at least ~10x, possibly worse depending on your
> exact design. If you are planning to use fine SRNAND cells (one per
> bit rather than one per word), expect a further slowdown of the word
> size factor.

interesting, because a multi-word design turned out to be necessary
to get a speedup factor (some of the Matrices are so large - 128 x 30) that
we get seconds per clock in pysim if done as individual one-bit Cells.
the Matrices are so large - 128 x 30 will be the largest - that with a
4 to 5 level hierarchy of Elaboratable classes it was intolerable.

by removing one level of hierarchy, pysim ran in reasonable time.

i found this generally to be true: the more levels of hierarchy, the
slower pysim got.  and due to the MASSIVE size of our designs, we need
hierarchically-laid-out classes (well over 250 so far).

> Another issue is that neither pysim nor cxxsim provide
> any way to control the possible race conditions. That is, the SRNAND
> latch that is simulated in pysim or cxxsim will initialize to
> an indeterminate value, and chaining them together will lead to
> unpredictable results.

you'll like this: it's much "worse" than it looks :)  we actually have
a triple-connected ring of SRNAND latches, acting as a pseudo pipeline.

however, it turns out that the set-reset logic which *enables* the three
SRNANDs is very, very specifically organised such that only two of them
are ever possible to have active at any one time, and thus we create
a "revolving door" effect.

those "protections" are all synchronous.  so whilst the SRNAND cells
are asynchronous and could hypothetically end up in "unknown", in
practice this can *never* occur because the conditions which create
"unknown" are specifically avoided (and avoided using synchronous
logic).

in case you were wondering: this is a proven design.  it was used in the
original CDC 6600, and also in the AMD Opteron Series.  however, in the
CDC 6600 they used transistors (as in: *actual* three-pronged transistors,
the largest ever single order made for transistors in the world), and
in the AMD Opteron Series they of course had the financial budget and
resources of a multi-billion-dollar company and so could happily pay for
custom silicon.

so.  conclusion, after all that (apologies), is back at the top.

-- 
You are receiving this mail because:
You are on the CC list for the bug.