[libre-riscv-dev] Introduction and Questions

Sat May 16 01:40:56 BST 2020

On Sat, May 16, 2020 at 12:49 AM Jeremy Singher <thejsingher at gmail.com> wrote:
>
> > not micro-ops, *actual* scalar add ops.
>
> I think we just have different terminology for this. To me, a micro-op
> is the smallest unit of compute that is passed through a pipeline, and
> a micro-op could indicate that the core should perform a scalar add
> op. I'm happy with your explanation, though, thank you.

:)  allow me to explain for the benefit of future readers.  these discussions
are archived and thus the list is an informational resource for others.

we do intend to do micro-ops.  this because we will be sharing the MUL and
DIV *integer* pipeline(s) with the FP MUL and FP DIV pipelines.

apart from that, there is a direct and one-to-one direct correspondance
between "POWER9 instruction" and what is termed "micro-ops".

that said, microwatt did some really clever munging, which we followed exactly.
so i saaay "direct", but... technically... now that i think about it,
you're correct.

an example:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/logical/main_stage.py;hb=HEAD

note there, OP_XOR, OP_OR, OP_AND, there's no "OP_NOR" or "OP_NAND"?
this is because the tables (see https://libre-soc.org/openpower/isatables/ which
we extracted from microwatt decode1.vhdl) have *pre-analysed* the operations
and mapped them down to:

* a pre-processing stage (bit-inversion of RA, selection of Carry=1/0/Carry_Reg)
* the main processing stage
* a post-processing stage (bit-inversion of output, setting of
Condition Register CR0)

we *believe* this is a reflection of the intent of the original
designers of POWER,
and the people who designed microwatt have been tracking POWER for 20 years.
many of them work for IBM Research.

so these are kiiinda micro-op'd... except actually what we did was: split those
3 phases into 3 separate combinatorial blocks which are re-used in multiple
pipeline stages.

> > it's basically c macros... applied to hardware.
>
> Good analogy, thanks. The one concern I have is that this "vector op
> fission" approach will pump the pipeline with tons and tons of
> scalar-ops,

i know, right? :)

> while a traditional vector execution engine would not
> deconstruct the vector instruction into many scalar instructions.

yes.  the only real place where there is cause for concern is on losing the
element-stride information from sequential LD/ST operations: see below.

> Does the vector pipeline use ALL the same execution resources as the
> scalar pipeline? The same issue queues, functional units, read ports,
> writeback paths?

yyyup :)

the only "problem" comes in the address-matching when looking to
merge a batch of LDs (or STs) into a single L1 cache-line read.
the address-matching if you lose the fact that the addresses were
originally element-strided and thus sequential, you get hit by a massive
power surge of over 1000 XOR gates to get that information back,
if you have 8 LD/ST CompUnits.

therefore the intention is to "mark" addresses as being sequential, if
they were auto-generated by an element-strided Vector-LD/ST, and
consequently, very few bits need to be examined to detect if the
address rolls over to a new L1 cache-line.

this is *literally* the only "unplanned" design optimisation that needs
to be added at the hardware level, just because of the decision to
jam as many scalar (element) operations into

> > > How does the scoreboard handle a WAW hazard... two instructions
> > > writing the same register? Does the second instruction wait for the
> > > first to finish?
>
> > sigh over a year ago i understood this enough to be able to answer.
> > what we will have to do is simply ignore the optimisation opportunity.
>
> Hmmm, ignoring WAW will significantly limit the ability of the core to
> exploit ILP, which is one of the fundamental reasons for going OOO.

we have a hell of a lot to get done in about 14 weeks, which is the
cutoff point at which we have to begin the actual layout.  our first primary
goal is to meet the Oct 2020 tapeout with "at least something functional".

after that, we can return to examining optimisations.  i have a way to
spot WAW, as well as a chain-elimination of all ongoing cascading
dependencies - it will just take me about an entire month to remember
it, and 3 months to implement it, and we don't have time to do that...
right now, this very minute.

for a 50 mhz proof-of-concept ASIC, i'm not going to worry about it.

> > no it does not.  you will need to state that you are happy to give
> > credit if you use the book chapters, and that if you pass them on you
> > will likewise ask the recipient to guarantee that they also will give
> > credit to Mitch Alsup.
>
> Of course, I'm glad to have the conditions of use AOT.

appreciated.  they are... absolutely fascinating, and extremely
informative, as well
as representing an advancement and significant simplification on how to do
OoO multi-issue designs.

it does have to be said that many Architects _have_ rediscovered the knowledge
from the original 6600, however with the majority of publicly-available designs
using *binary* Q-Tables, it becomes extremely difficult to do multi-issue on top
of binary-encoded Q-Tables.

one of the (many) insights that Mitch provided was that if you convert
the Q-Tables
to an *unary* array, you can perform transitive RaW/WaR/WaW cascade bit-setting
of the register vector numbers across multiple instructions in any
given multi-issue
batch, and thus, with simple single-bit OR and AND gates, throw
multiple instructions
into the Dependency Matrices and have their order still be correctly
preserved in
the Dependency Matrices DAGs.

in addition to that, the chapters describe how to implement Precise Exceptions
on top of a 6600 design.  it's real simple.  i leave it to you to read up on.

the only thing you need to be aware of is that the design concept described in
Chapter 11 is faulty (the one that removes the FU-FU Matrix and replaces it
with a big NOR gate).  it unfortunately took Mitch and I about 3-4 months of
discussion (and implementation) to detect this error.

the description in chapter 10 however is perfectly functional.

*be aware* also that the gate-level diagrams for the FU-Regs, FU-FU
Matrices, and for the Computation Units *require* that certain operations
take place on the RISING edge of CLK, whilst others take place on the
FALLING edge of CLK.

if this is not borne in mind, the diagrams look faulty and will not work if
implemented as-is on a "normal" (rising edge) modern HDL clock design.
i converted the code in libre-soc to work *solely* on the RISING edge.

> Its incredibly
> annoying when one is provided some information, only for the provider
> of the info to come back months later and tell you "Remember that
> document I sent you, yeah, make sure you never reference it in any
> published work ever".

:)

l.