[libre-riscv-dev] Introduction and Questions

Jeremy Singher thejsingher at gmail.com
Fri May 15 23:15:30 BST 2020


> it's basically a Cray Vector Loop
> (VL) except that VL, when set, actually *PAUSES* the Program Counter,
> and issues MULTIPLE instructions - using the one at the PC as a
> "template" - but incrementing the operand register numbers
> sequentially on each iteration of the loop from 0 to VL-1.

So if you decode a vadd instruction, with VL=4, and a 1-wide pipeline,
the design would stall instruction fetch, and issue 4 "scalar" add
micro-ops with incrementing register indices? That seems like it
should work, although you would lose the ability to exploit vector
chaining, since pausing the PC would prevent additional instructions
after the vector instruction from getting issued. I looked a SimpleV a
while ago, and to me it seemed like the difference vs a normal vector
architecture was that SimpleV merged the vector and integer register
files.

> unfortunately, when the I-Cache is shared, we do give a damn.
> therefore we need to reduce the instruction size and came up with
> SimpleV (look it up).

Yep, this seems like a reasonable conclusion to draw.


> this has "implicit"
> register-renaming automatically built-in to the design.  these are the
> Function Unit operand latches.  they are best termed "nameless
> registers" rather than "register-renaming".

I think I understand what you are saying, my terminology for this is
that the core is "renaming" registers into what you call inflight
"nameless registers". At the end of the day, the point is that
register data is stored not just in a logical register file, and that
the scheduler can aggressively execute instructions OOO when their
operands are available, instead of in serializing execution in program
order.
How does the scoreboard handle a WAW hazard... two instructions
writing the same register? Does the second instruction wait for the
first to finish?

> Mitch Alsup had to write two
> additional chapters, as an addendum to James Thornton's book, "Design
> of a Computer", in order to correct Patterson's misinformation.

Do you have a link to the addendum? I have a pdf of the book, but it
doesn't seem to have the addendum.



On Fri, May 15, 2020 at 2:54 PM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>
> On Fri, May 15, 2020 at 9:50 PM Jeremy Singher <thejsingher at gmail.com> wrote:
> >
> > > a teeny tiny bit over what we initially planned, then :)
> >
> > Thanks Luke, these numbers are closer to what seems reasonable to me.
> > A simple superscalar core with 2 FPU functional units should be able
> > to achieve 4 GFLOPS, so a vector unit would need to go higher than
> > that to be worth the extra silicon.
>
> the hardware requirements are derived from the software ones.  we
> decided though to do a hybrid design rather than a separate CPU
> separate GPU approach... because... well... um... that's.... two...
> processors :)
>
> hybrid designs are extremely unusual.  there's only 2 that i have
> heard of: the ICubeCorp IC3128, and the Broadcom VideoCore IV (based
> around an ARC core, ARC - now bought by Synopsis - is known for
> providing a wealth of SIMD and other "additional" instructions,
> similar to how RISC-V is now set up, except completely secretive and
> under the full and total proprietary control of Synopsis).
>
> i've looked at Vector ISAs, and SIMD ISAs: i concluded that there's a
> better way to do a Vector ISA, keeping the *semantics* outlined in the
> article "SIMD Considered Harmful".
>
> > > by overloading "multi-issue".  a hardware for-loop at the instruction
> > > execution phase will push multiple sequential instructions into the
> > > OoO engine from a *single* instruction.
> >
> > I see, so a form of micro-op fission,
>
> yes, in effect.  this reduces I-Cache pressure, which in a *hybrid*
> processor is critically important.  in a separate-CPU,separate-GPU
> design you have two I-Caches, running two separate executables, you
> don't _give_ a damn.
>
> in a hybrid processor, where GPU opcodes are often 64-bit and are more
> usually 128-bit (FOUR banks of 32-bit VLIW instructions), because one
> 64-bit instruction may be doing 16x FP vector operations who _gives_ a
> damn.
>
> unfortunately, when the I-Cache is shared, we do give a damn.
> therefore we need to reduce the instruction size and came up with
> SimpleV (look it up).
>
> > essentially you are decoding
> > complex vector (or "GPU") instructions into a sequential stream of
> > simple micro-ops? This seems like a pretty reasonable design point.
>
> mmm not quite.  look up SimpleV.  it's basically a Cray Vector Loop
> (VL) except that VL, when set, actually *PAUSES* the Program Counter,
> and issues MULTIPLE instructions - using the one at the PC as a
> "template" - but incrementing the operand register numbers
> sequentially on each iteration of the loop from 0 to VL-1.
>
> > > the DIV pipeline will be... 8 (maybe more), MUL will be 2, ADD (etc.) will be 1.
> > > * fetch: 1.
> > > * decode: 1.
> > > * issue and reg-read: 1 (minimum).
> > > * execute: between 1 and 8 (maybe more)
> > > * write: 1 (minimum)
> >
> > So this is just for the preliminary tapeout?
>
> yes, in 180nm.  which can only achieve a max of around 300mhz.  if we
> had a PLL.  which we do not.  because we are not signing Foundry NDAs.
> and the Foundry will only give us the PLL block they normally hand
> out.... under NDA.
>
> > I would expect an OOO
> > core targeting 1GHz+ to have more pipeline stages than this (10+). The
> > A72, Skylake, Zen, and BOOM all seem to target 10+ stages.
>
> yep.  and these are all (except BOOM) USD $100 million custom silicon designs.
>
> > > basically this is why we're doing OoO because coordinating all that in
> > > an in-order system would be absolute hell.
> >
> > I'm a bit confused now. In-order execution with variable-latency
> > functional units can be achieved with a scoreboard,
>
> which we have.  using an augmented variant of the 6600, thanks to Mitch Alsup.
> https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
>
> > and the pipeline
> > stages you describe reminds me of a in-order core with OOO write-back.
>
> pipelines can still be used behind Function Unit "front-ends".  Mitch Alsup
> describes them as "Concurrent Computation Units".  basically if the pipeline
> is N deep, you put N-or-greater Function Unit "Reservation-Station" style
> operand latches on the front, that gives you the ability to track the pipeline
> results, match them *back* up with the FU index, and voila, you have
> turned what is otherwise considered the exclusive domain of "in-order pipeline"
> terminology into something that fits into OoO.
>
> > Is this what you are doing?
>
> no.
>
> > Or are you pursuing true OOO execution
>
> yes.
>
> > with register-renaming?
>
> no / misleading question.
>
> we are using the 6600 design, from 1965.  this has "implicit"
> register-renaming automatically built-in to the design.  these are the
> Function Unit operand latches.  they are best termed "nameless
> registers" rather than "register-renaming".
>
> if you are familiar with the Tomasulo Algorithm: if you make the
> number of Reservation Station Rows equal to 1, then the RS latches are
> directly equal and equivalent to Function Unit operand latches, and
> both are in effect "register renaming"... only the better terminology
> is "nameless registers" or "in-flight registers / results".
>
> if you have read his book and consider it to be the last word on
> scoreboards, Patterson has completely and utterly misinformed the
> general public and the academic community with his book, regarding the
> 6600 and scoreboards in general.  Mitch Alsup had to write two
> additional chapters, as an addendum to James Thornton's book, "Design
> of a Computer", in order to correct Patterson's misinformation.
>
> > > it's down ultimately to how wide the register paths are, between the
> > > Regfile and the Function Units.  if we have multiple Register Data
> > > Buses, and can implement the multi-issue part in time, it'll be coded
> > > in a general-purpose enough way that we could consider 4-issue.
> > > however... without the register data paths, there's no point.
>
> >
> > Fair enough. Balancing the throughput of all the components in a OOO
> > core is tricky, and register read can certainly be a bottleneck.
>
> yes.  see https://groups.google.com/d/msg/comp.arch/qeMsE7UxvlI/AmwsomDoAQAJ
> i can't find the diagram i drew at the moment... regfile i think...
> ah yes here you go:
> https://libre-soc.org/3d_gpu/architecture/regfile/
>
> that's a global cyclic buffer at the regfile (acts as an Operand
> Forwarding Bus) and
> *local* cyclic buffers at the Function Units.
>
> > > We had some unfortunate run-ins with Berkeley and RISCV - and we decided to go with POWER because we have access to some important relations at IBM
> >
> > Have you looked at the open RISC-V micro-architectures though?
>
> yes.  they are nowhere near the level of comprehensiveness needed.
> plus, it would be incredibly irresponsible of us to bring to market a
> 100 million and above *libre* design where the modifications to the
> compiler, toolchain, everything, all needed to be upstream.
>
>
> > Even if
> > this core runs POWER, it should be not too difficult to port design
> > ideas from existing cores, to avoid reinventing the wheel everywhere.
>
> because of the way that SimpleV works, and because we need to
> drastically extend the target ISA (POWER9) with opcodes that are
> everyday common-usage in GPUs (Texture Ops, SIN, COS, RGB2YUV), whilst
> on the face of it the idea sounds "perfectly reasonable", the changes
> required are so fundamental that starting from someone else's design
> would produce massive barriers.
>
> example: we have created PartitionedSignal - a class that not just
> "performs add operations", it takes an *additional* (dynamic)
> side-argument - a mask - which *dynamically* partitions what would
> otherwise be a 64-bit operation into *any* combination of 8, 16, 24,
> 32, 40, 48 and 56 bit operations.
>
> this class, because we are using python, can be handed in as a kls in
> an Object-Orientated fashion, to *replace* nmigen Signal, and thus we
> may construct code that on the first usage is a simple pipeline,
> however we may then convert that EXACT SAME CODE into a dynamic SIMD
> variant purely with a one-line change.
>
> these kinds of techniques are normally the exclusive domain of
> experienced Software Engineering teams.  for a *hardware* team to
> deploy them?  this is unheard-of, and if attempted with even Chisel3 /
> Scala, would have the team running away screaming within about 6
> months.  a team that was asked to do that in verilog or VHDL would
> flat-out say "no".  or ask for 50 extra engineers.  which, for IBM or
> Intel or ARM is not unreasonable, but for us it is not.
>
> hence we started from scratch, in order to be able to deploy these
> kinds of Software Engineering techniques.
>
> l.
>
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev



More information about the libre-riscv-dev mailing list