[libre-riscv-dev] store computation unit

Wed Jun 5 04:19:52 BST 2019

DISCLAIMER: Everything you are about to read is my understanding.  I
don't think so, but I could well be wrong about everything.  I am,
however, very likely to be wrong about /something/.  You have been
warned.  This is also one of the few times when I will assert my
opinions on the matter as well, as this topic has, somewhat
tangentially, been rubbing on a nerve of mine for some time, and is
one of the things which motivated me to basically fork RISC-V ever so
slightly for my Kestrel project.  I would like to take this time to
apologize in advance if I am overstepping my bounds; but, I feel
putting my thoughts on record is slightly more important than how the
core message is conveyed.  We can worry about the latter later.  :)

Also, my text markup is based on Emacs org-mode.

Remember, I'm looking at this problem through the point of view of
hopefully supporting a future community of neo-retro computer
entheusiasts interested in a new home-brew computer design that both
they and I want to support with upgrades later on (including CPU
upgrades) with minimum engineering fuss, minimum need for centralized
decision making or registries, *zero* need for special-interest and
lobby groups, and maximum plug-and-play capabilities.  We *almost* had
this with the Commodore-Amiga family of computers, so I know it is
possible to achieve with sufficient fore-thought.

On Tue, Jun 4, 2019 at 4:51 PM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>  The recommended approach to IO Domains in the linux world is to have the
> PhysAddr in fixed ranges.

The ranges /can/ be configurable if you support the PMP CSRs.  It's
not clear to me that it is worth doing in any kind of portable way,
however.  While I'm a proponent of PMAs as a means of standardizing
the vernacular used to talk about memory attributes and their
semantics, I'm not really a fan of PMPs as a means of configuring how
the processor will interact in its surroundings.  Consider the case
where I want to upgrade from my own home-brew PSP530x0 family of
processors[1] where I hypothetically support PMP configuration
registers[2] to the finished ASIC form of libre-riscv so that I can
exploit a 20x to 100x boost in performance by clock speed alone.  I
think it's quite understandable why someone like me would be
interested in making that kind of a jump; except for one small
problem: libre-riscv in this hypothetical future does not support
PMPs, or if it does, not in the same way as my processor.  (Or, vice
versa and more realistically: my current CPU and firmware are totally
ignorant of PMPs, but libre-riscv might not be!)

There are several solutions to this problem.  Apologies in advance,
I'm going to use terms familiar to me in the context of a desktop
computer (processor cards and motherboards, etc.); however,
generalization to working with SoCs should be easy to make.

1) The cold boot/M-mode firmware is tightly bound to the processor
card, not to the motherboard (processor core, not to the SoC, resp.).
Motherboard-specific functionality appears on the motherboard at
well-defined locations, but runs in S-mode.

2) The entire firmware stack is tightly bound to the motherboard, but
is agile enough to respond to the features supported by the processor
card itself.

3) The system firmware is not tightly bound to anything (but resides
on the motherboard), but is dumber than a brick and takes whatever a
device-tree configuration node says is right.  In this case, we have
two sub-options:

    a.  The device-tree node in question is tightly bound to the
processor card, or,

    b.  The device-tree node is co-resident *on* the processor card,
but is made to appear in a standardized location (somehow) for
inspection by the system firmware post-boot.

I'm sure there are other options I haven't considered; but, this is
getting awfully complicated already, so I'll stop here.

As a result of the priv-spec's allowance for an incomplete
intersection of the set of PMP functionality between cores, the desire
to substitute between two RISC-V implementations with minimal changes
to the software is severely compromised.  I now need to build my
system firmware to support some manner of "PMP agility"; for instance,
with option 3 above, the idea of reading some kind of standardized
device tree that tells me whether or not PMPs are supported and to
what level of compliance in the privilege specification, and this must
be done *before* I actually touch RAM because I don't yet know if I
need to configure a PMP before RAM becomes useful.[3]

Access to unsupported CSRs are intended to be treated like illegal
instructions, so I could just probe the presence of PMPs by way of
capturing that illegal instruction trap; but, it only tells me whether
(and how many) PMP registers are supported.  This would be an example
of option 2 above.  It doesn't tell me which operating modes are
actually supplied by the core I've chosen, however.  It also doesn't
tell me if the PMP "lock" feature is supported either, and *that* is
not discoverable (if you attempt to set it, it's set forever until
hard reset; so, make sure this is the /last/ thing you do!).

Option 1 seems quite appealing; here, you're bundling all machine-mode
firmware with the CPU core itself, which can take care of your PMP
settings for you, and you just write your motherboard firmware to run
in S-mode or what have you, at some standardized location.  Eeeexcept
that Ron Minnich (maintainer of coreboot, a porter of Plan 9, and a
frequent contributor to Linux, amongst so many other things) has
rock-solid reasons why a motherboard's firmware must have access to
M-mode as well.  WHOOPS!

So whether you are parsing a DT representation or you are probing your
register set to see what it can and cannot do, you /have/ to store
this information somewhere in ... RAM ... which you might not have
access to yet because the PMPs are not yet configured.

Note that we haven't even gotten to auto-configuration of the
motherboard subsystems yet.

I hope I'm not the only one who sees a problem with this design
approach.  At this point, I'm now going to intercede with my points of
view more directly.

My preferred approach, one which I have not proven in design yet but
which I'm working towards, is to augment the memory ports with a set
of signals which is intended to come from an /external/ (to the core
itself, but not necessarily off-chip) address decoder; something
resident on the system's equivalent of a motherboard.  There are three
mandatory behaviors that I've considered for load/store instructions
that I've identified (a fourth is an optimization):

1) This operation is addressing memory which is guaranteed to have no
side-effects, and is safe to parallelize in the usual OoO way, and
will NEVER raise any exceptions, guaranteed.  Access to ordinary
system and/or expansion RAM would be an example of this.  It MAY raise
an exception later on (e.g., in the event of a bus error reported
down-stream, like an uncorrectable ECC), but it's not likely to.  If
it does, it's a critical hardware failure, and an imprecise exception
(of the "machine check" variety) is the only correct response anyway,
so flag it as an NMI of some kind.

2) This operation is addressing memory which is guaranteed to have no
side-effects, but which MAY raise an exception down-stream as a normal
part of its behavior (e.g., a patch of *SoC-external* system RAM which
is intended to only be accessible in M-mode; for an anachronistic
example, the Amiga 1000's "Kickstart RAM" used to emulate ROMs which
didn't exist at the time of manufacture).  Such loads/stores must
shadow subsequent loads and stores until it is known that no exception
can happen.  Performance is, thus, "medium" -- not as slow as a fully
in-order access, but not as fast as one which the CPU can blithely
assume will fully lack side-effects.  This will require a supporting
protocol with external memory, of course, but I intend on
demonstrating in my own CPU design that this is not hard to do.
Exceptions raised through this mode use the usual illegal load/store
exceptions, not NMIs.

3) This operation is addressing side-effecting memory space, and thus
must be executed in-order up to and including through commit for both
reads and writes.  Exceptions raised in this mode use the usual
illegal load/store exceptions, not NMIs.

4) This operation is addressing a block of memory which doesn't
actually exist (and/or you attempted to access this block of memory
with the incorrect processor privilege; again see Amiga 1000 Kickstart
RAM example), and you must raise an illegal load/store exception
*right now*.  This is just a performance optimization for (3) above,
where you don't want to have to wait before taking an exception.

The idea behind the PMP registers is that whatever logic is
responsible for altering the behavior of load/store instructions in
the face of different types of memory does so close to the processor
core so that it doesn't have lengthy combinatorial loops to an address
decoder to slow down the core's core clock speed.  The motivation is
sound and good; the implementation, unfortunately, is both terribly
limiting and compromises cross-core software compatibility.  It also
means lots and lots of comparators in your design, especially if you
implement the *full* PMP specification.  Since I'm targeting
relatively small FPGAs, this is an expensive proposition for my needs;
that *guarantees* the need to be PMP-agile in my system firmware
design unless I want to hard-wire my system firmware to my specific
processor core design.

In conclusion, I hope I've not overstepped my bounds, but I felt
compelled to vent a little bit in the hopes that my concerns falls on
sympathetic ears.  I'd be happy to learn of alternatives to the
solutions to the problems I've examined above; or, explanations on why
they're not problems in the first place.  It is my hope that my
consternation can somehow be converted, if not now, then certainly
later on, into a design which is upwardly compatible with a minimum of
rework for the system software designer and inconvenience for the home
computer user.  In short, yes, I want to re-capture the user
experience of the Commodore 64 and the Commodore-Amiga/Atari ST/Atari
TT generation of computers.  I might not be successful; but that's my
goal.

Thank you for listening to my squabble; here's the soapbox back.  ;)

> I cannot recall immediately if this information propagates up to the ISA
> when it comes to AMO operations.

I get the impression that it is supposed to, where it is important.
The description of the FENCE instruction, for example, distinguishes
between I/O and non-I/O memory accesses, so somewhere in the processor
logic, it must have an understanding of what is an I/O reference and
what isn't.  It's just not specified *how* this happens.  It could be
hardwired into the instruction decode logic for all the ISA spec is
concerned about.

> I believe that if the order really matters that LRSC and AMO are supposed
> to be used, and FENCE instructions used as hints that inform the execution
> about the memory order.

I think that's the case; perhaps with the caveat that on systems
without AMOs (which if you support, LR/SC is required as well, IIRC),
FENCE becomes a compulsory method of enforcing memory access
invariants.

________

1.  Yes, it's inspired by, and somewhat of a cheeky take on,
Motorola's processor numbering system.

2.  I won't, but play along for illustration purposes, please.

3.  And now you know why I so thoroughly hate and oppose device tree
representation; the need to have to *parse* it that early in the cold
boot sequence is, in my mind, just stupid and burdensome and highly,
HIGHLY, error-prone.  Amiga and its principle of KISS absolutely did
the right thing here; but, who am I but a humble nobody who tries to
follow in the footsteps of each of the Amiga's core engineering team?
Pardon me as I need to go yell at some more clouds.  ;)

-- 
Samuel A. Falvo II