[libre-riscv-dev] Introduction and Questions

Fri May 15 22:53:21 BST 2020

On Fri, May 15, 2020 at 9:50 PM Jeremy Singher <thejsingher at gmail.com> wrote:
>
> > a teeny tiny bit over what we initially planned, then :)
>
> Thanks Luke, these numbers are closer to what seems reasonable to me.
> A simple superscalar core with 2 FPU functional units should be able
> to achieve 4 GFLOPS, so a vector unit would need to go higher than
> that to be worth the extra silicon.

the hardware requirements are derived from the software ones.  we
decided though to do a hybrid design rather than a separate CPU
separate GPU approach... because... well... um... that's.... two...
processors :)

hybrid designs are extremely unusual.  there's only 2 that i have
heard of: the ICubeCorp IC3128, and the Broadcom VideoCore IV (based
around an ARC core, ARC - now bought by Synopsis - is known for
providing a wealth of SIMD and other "additional" instructions,
similar to how RISC-V is now set up, except completely secretive and
under the full and total proprietary control of Synopsis).

i've looked at Vector ISAs, and SIMD ISAs: i concluded that there's a
better way to do a Vector ISA, keeping the *semantics* outlined in the
article "SIMD Considered Harmful".

> > by overloading "multi-issue".  a hardware for-loop at the instruction
> > execution phase will push multiple sequential instructions into the
> > OoO engine from a *single* instruction.
>
> I see, so a form of micro-op fission,

yes, in effect.  this reduces I-Cache pressure, which in a *hybrid*
processor is critically important.  in a separate-CPU,separate-GPU
design you have two I-Caches, running two separate executables, you
don't _give_ a damn.

in a hybrid processor, where GPU opcodes are often 64-bit and are more
usually 128-bit (FOUR banks of 32-bit VLIW instructions), because one
64-bit instruction may be doing 16x FP vector operations who _gives_ a
damn.

unfortunately, when the I-Cache is shared, we do give a damn.
therefore we need to reduce the instruction size and came up with
SimpleV (look it up).

> essentially you are decoding
> complex vector (or "GPU") instructions into a sequential stream of
> simple micro-ops? This seems like a pretty reasonable design point.

mmm not quite.  look up SimpleV.  it's basically a Cray Vector Loop
(VL) except that VL, when set, actually *PAUSES* the Program Counter,
and issues MULTIPLE instructions - using the one at the PC as a
"template" - but incrementing the operand register numbers
sequentially on each iteration of the loop from 0 to VL-1.

> > the DIV pipeline will be... 8 (maybe more), MUL will be 2, ADD (etc.) will be 1.
> > * fetch: 1.
> > * decode: 1.
> > * issue and reg-read: 1 (minimum).
> > * execute: between 1 and 8 (maybe more)
> > * write: 1 (minimum)
>
> So this is just for the preliminary tapeout?

yes, in 180nm.  which can only achieve a max of around 300mhz.  if we
had a PLL.  which we do not.  because we are not signing Foundry NDAs.
and the Foundry will only give us the PLL block they normally hand
out.... under NDA.

> I would expect an OOO
> core targeting 1GHz+ to have more pipeline stages than this (10+). The
> A72, Skylake, Zen, and BOOM all seem to target 10+ stages.

yep.  and these are all (except BOOM) USD $100 million custom silicon designs.

> > basically this is why we're doing OoO because coordinating all that in
> > an in-order system would be absolute hell.
>
> I'm a bit confused now. In-order execution with variable-latency
> functional units can be achieved with a scoreboard,

which we have.  using an augmented variant of the 6600, thanks to Mitch Alsup.
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/

> and the pipeline
> stages you describe reminds me of a in-order core with OOO write-back.

pipelines can still be used behind Function Unit "front-ends".  Mitch Alsup
describes them as "Concurrent Computation Units".  basically if the pipeline
is N deep, you put N-or-greater Function Unit "Reservation-Station" style
operand latches on the front, that gives you the ability to track the pipeline
results, match them *back* up with the FU index, and voila, you have
turned what is otherwise considered the exclusive domain of "in-order pipeline"
terminology into something that fits into OoO.

> Is this what you are doing?

no.

> Or are you pursuing true OOO execution

yes.

> with register-renaming?

no / misleading question.

we are using the 6600 design, from 1965.  this has "implicit"
register-renaming automatically built-in to the design.  these are the
Function Unit operand latches.  they are best termed "nameless
registers" rather than "register-renaming".

if you are familiar with the Tomasulo Algorithm: if you make the
number of Reservation Station Rows equal to 1, then the RS latches are
directly equal and equivalent to Function Unit operand latches, and
both are in effect "register renaming"... only the better terminology
is "nameless registers" or "in-flight registers / results".

if you have read his book and consider it to be the last word on
scoreboards, Patterson has completely and utterly misinformed the
general public and the academic community with his book, regarding the
6600 and scoreboards in general.  Mitch Alsup had to write two
additional chapters, as an addendum to James Thornton's book, "Design
of a Computer", in order to correct Patterson's misinformation.

> > it's down ultimately to how wide the register paths are, between the
> > Regfile and the Function Units.  if we have multiple Register Data
> > Buses, and can implement the multi-issue part in time, it'll be coded
> > in a general-purpose enough way that we could consider 4-issue.
> > however... without the register data paths, there's no point.

>
> Fair enough. Balancing the throughput of all the components in a OOO
> core is tricky, and register read can certainly be a bottleneck.

yes.  see https://groups.google.com/d/msg/comp.arch/qeMsE7UxvlI/AmwsomDoAQAJ
i can't find the diagram i drew at the moment... regfile i think...
ah yes here you go:
https://libre-soc.org/3d_gpu/architecture/regfile/

that's a global cyclic buffer at the regfile (acts as an Operand
Forwarding Bus) and
*local* cyclic buffers at the Function Units.

> > We had some unfortunate run-ins with Berkeley and RISCV - and we decided to go with POWER because we have access to some important relations at IBM
>
> Have you looked at the open RISC-V micro-architectures though?

yes.  they are nowhere near the level of comprehensiveness needed.
plus, it would be incredibly irresponsible of us to bring to market a
100 million and above *libre* design where the modifications to the
compiler, toolchain, everything, all needed to be upstream.

> Even if
> this core runs POWER, it should be not too difficult to port design
> ideas from existing cores, to avoid reinventing the wheel everywhere.

because of the way that SimpleV works, and because we need to
drastically extend the target ISA (POWER9) with opcodes that are
everyday common-usage in GPUs (Texture Ops, SIN, COS, RGB2YUV), whilst
on the face of it the idea sounds "perfectly reasonable", the changes
required are so fundamental that starting from someone else's design
would produce massive barriers.

example: we have created PartitionedSignal - a class that not just
"performs add operations", it takes an *additional* (dynamic)
side-argument - a mask - which *dynamically* partitions what would
otherwise be a 64-bit operation into *any* combination of 8, 16, 24,
32, 40, 48 and 56 bit operations.

this class, because we are using python, can be handed in as a kls in
an Object-Orientated fashion, to *replace* nmigen Signal, and thus we
may construct code that on the first usage is a simple pipeline,
however we may then convert that EXACT SAME CODE into a dynamic SIMD
variant purely with a one-line change.

these kinds of techniques are normally the exclusive domain of
experienced Software Engineering teams.  for a *hardware* team to
deploy them?  this is unheard-of, and if attempted with even Chisel3 /
Scala, would have the team running away screaming within about 6
months.  a team that was asked to do that in verilog or VHDL would
flat-out say "no".  or ask for 50 extra engineers.  which, for IBM or
Intel or ARM is not unreasonable, but for us it is not.

hence we started from scratch, in order to be able to deploy these
kinds of Software Engineering techniques.

l.