[libre-riscv-dev] [OpenPOWER-HDL-Cores] microwatt tlb

Paul Mackerras paulus at ozlabs.org
Mon Mar 30 04:30:02 BST 2020

On Fri, Mar 27, 2020 at 02:15:03AM +0000, Luke Kenneth Casson Leighton wrote:
> hi this is for paul mackerras when you see it.
> great to know you're working on microwatt. been such a long time, and
> really good to encounter you over the virtual coffee chat earlier today.
> here's the link to the radix walk doc i was talking about:
> https://github.com/power-gem5/gem5/blob/gem5-experimental/src/arch/power/radix_walk_example.txt
> it is really good.  implementations also exist in the same codebase, the
> easiest way to track them down is via the commit logs in the
> gem5-experimental branch.
> for libresoc we really really need clear HDL to "track" if you know what i
> mean, because some of these algorithms are very timeconsuming to implement
> from scratch.
> paul, continuing the (brief) discussion somewhat, in a related way, i
> really wanted to make you aware of the implications for microwatt of trying
> to make it multistage pipelined.
> if you look up jon dawson's IEE754 FPU code you find that it is entirely
> FSM based.  as such it is superb for fitting in small FPGAs.
> over a 6 month period i morphed it from its humble beginnings into an
> absolute beast of a flexibly reprogrammable Object Orientated IEEE754 FP
> toolkit, capable of 16, 32 or 64 bit (and anything else you care to add),
> arbitrary pipeline depth *or* FSM style engines, SIMD capability, *dynamic*
> pipelines, the works.  Jacob helped add a fantastic hybrid fpdiv, sqrt
> *and* rsqrt engine all in the same pipeline.  it's now well north of 10,000
> lines of python code and takes a week to run all the unit tests.
> what i learned along the way was that the FSM based systems vs pipeline
> systems trade low gate count and simplicity and code readability for speed,
> much higher gate count, and, unless *and even* if OO design is used, turn
> into the most awful unreadable unmaintainable total trash.
> rocketchip unfortunately, for all the compactness and other advantages, is
> particularly bad, *because* of the OO (and total lack of code comments).
> the OO hierarchy makes it flat-out impossible for anyone but its original
> creators to work with.  (something to alert the chiselwatt team on, there,
> although from what i have seen, the code comments in chiselwatt are damn
> good).
> and yet if you even remotely try anything like modern OO programming design
> strategies and tecuniques with these 1980s era languages (HDL, Verilog) you
> immediately also run into serious difficulties of a different kind, which i
> won't go into, right now.
> the bottom line is: adding extra pipeline stages to microwatt not only
> complicates it and makes it unreadable and unmaintainable, it increases the
> number of LUTs required to the pont where you will need to call it "watt"
> or "megawatt"
> :)
> regarding exceptions and traps: for in-order pipelined systems these are a
> damn nuisance.
> every solution to every problem in an inorder system is:
> stall.
> that's it.
> after learning about the CDC6600, which is a deeply impressive design that
> takes literally months to begin to comprehend, i dislike inorder systems
> intensely.
> the key thing that i learned from Mitch Alsup about how to handle
> exceptions is: you *must* prevent the ALU (or whatever) from "committing"
> its result until it is GUARANTEED known that the exception will not occur.
> that is the absolute core concept of exceptions (aka interrupts), you MUST
> achieve, in some fundamental way.
> this is done in an OoO design by marking and tracking the operation - all
> the way down the pipeline - with "cancellation masks" (a global ID which
> travels *with* the partial result data, down a given pipeline)
> this is known as "shadowing", and it is not only the operation that *might*
> be cancelled that has to be "shadowmarked", it is *all following
> instructions as well* (hence the term "shadow").
> shadowed operations are real easy to cancel: pull the "cancel" wire up, ane
> everything with the requisite mask ID gets a global signal "whoops we no
> longer need to pass this down the pipeline, drop it on the floor".
> with all "future" partial (and completed but noncommitted) results now
> killed, the PC can safely be redirected at the exception, interrupt, or the
> alternative branch, or whatever: these all use *exactly* the same shadowing
> technique.
> you start to see why the majority of inorder systems use "stall stall stall
> stall stall" as the be-all and end-all "onnne and only true solution",
> here? [ hallelujah, praise the stall... ] :)
> basically if i have discouraged you from pursuing the path of adding
> multiple pipeline stages to microwatt, then i have succeeded in what i set
> out to explain.  because microwatt's value is not just its small code size,
> its readability and compactness makes it an ideal reference implementation
> *and*, furthermore, its complete lack of optimisation normally seen in
> "simple" pipelined designs actually saves on gates / FPGA resources.
> the moment you add extra stages, all that goes out the window.
> an associate and contributor on our list, Samuel Falvo, taught me some
> awesome tricks.
> his CPUs are incredible, designed in the most amazing elegant and unusual
> ways.  he wrote his own PLA style "language" (written in lisp) which takes
> specifications (in lisp lists) for functionality and *generates* VHDL
> combinatorial blocks.
> he used this tool to create a 6502-like processor with virtually no
> manually written glue logic (even the decoder was written in this simple
> specification), that easily fits into an ultra low cost ICE40 yet has
> massive amounts of room to spare for peripherals.
> the key to the success of this approach was the heavy tradeoff of *not*
> using pipeline stages, but effectively doing a FSM style entire execution
> design.  actually it wasn't even a FSM, i think he used the term "PLA".  i
> am garbling it, it was a year ago, there are some keywords i have missed
> out, you get the idea though.
> what i am really saying is: please do seriously consider simply stalling on
> anything that could have an exception thrown.  arrange for the entire
> pipeline to simply grind to a halt, *globally* freezing right the way back
> to the decode phase, only proceeding when it is guaranteed known, 100%,
> that the exception will NOT occur.
> this is, believe it or not, an industry-wide "acceptable" technique in ALL
> in-order designs.
> look up "minerva rv32" on github (it is very readable code) and grep its
> source for the word "stall".
> you will find a *combinatorial* global signal propagates throughout *every*
> pipeline stage, implementing this "stop the world i want to get off!"
> industry-wide "solution".
> literally every effort that i have seen that attempts *not* to follow that
> "solution" for inorder designs involves the word "but" and it goes downhill
> from there.
> surprising, ehn? :)

Thanks for the comments and the pointers.  We do already have stall
signals in microwatt, but the way it works is pretty simple so far.
The pipelined, in-order execution structure basically comes from
Anton's initial work, so I don't think we'll be throwing that out.
We will continue to make readability and understandability a focus,
and I for one will try to add more code comments as I work on it to
make it easier to follow.


More information about the libre-riscv-dev mailing list