[libre-riscv-dev] [Bug 132] SIMD-like nmigen signal for partitioning

Thu Aug 15 08:17:10 BST 2019

http://bugs.libre-riscv.org/show_bug.cgi?id=132

--- Comment #24 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #22)
.
> > 
> > If the index cannot be cleared because there are three extra clock delays
> > until it clears out the end of the pipeline, that is three Reservation
> > Stations that cannot be used.
> 
> The index can be cleared right away and the Reservation Stations reused. All
> that is needed is to have a muxid of 0 (or some other appropriate value or a
> muxid_valid signal) mean that there is no instruction there so the output
> should be ignored.
>

The unary version of the mask is exactly that "muxid_valid" signal, and no, it
cannot be used in the way that you envisage, because the muxid and the masks
travel in sync with the data.

They do not "emerge" from the pipeline the instant that the [global] stop mask
signal is raised.

What would be needed instead wouls be to use the scheme deployed by IIT Madras.
 They had a muxid that was HIGHER than the number of RSs, and did "counting".
Each issued operation got a unique index.

The maximum count is twice the length of the pipeline and is a rolling count.

By having a count that is twice the pipeline depth it becomes possible to
distinguish cancelled operations from uncancelled ones.

The problem is: it's a *binary* system. Adding multi issue on top of a *binary*
counter is hell.

We are critically relying on multi issue capability.

Multi issue on a unary system is, believe it or not, trivial.

Mitch explained that all you do is transitively accumulate read and write
dependency hazards - multiple bits in the unary masks/regmaps - and the DMs
take care of the rest.

Incredibly, that's *it*!  We get to take on Intel and ARM, thanks to Mitch's
input.

If we deploy binary counters, we literally destroy the underpinnings of the
entire architecture.

> Since the Reservation Stations can be reused right away,

No, they can't. Not without having the cancellation logic and the global stop
mask.

I did think of a solution.

The Pipeline API was designed to be sync *and* comb based.

The Stage API is DEFINITELY intended to be combinatorial based ONLY.

The Pipeline API was NEVER designed to allow Stages to be sync based, and I
strongly disagree that it should be extended to permit such.

That said, there is a way it can be done. Bear with me, because whilst it can
be done, there is still no point doing it.

The way it can be done without completely trashing the Pipeline API and the
entire design strategy behind it is to create "fake" data that doesn't actually
get set.

Pipeline Stage 1 (a ControlBase derivative) would be given an object that we
KNEW in advance was hardwired to pass its data to Pipeline Stage 3.

The stage instance conforms to the Stage API, its ospec however returns an
EMPTY OBJECT that has an eq function that does ABSOLUTELY NOTHING.

You can probably see where that is going.

So whilst the ControlBase instances (MaskCancellables) handle the control
logic, the data goes through the pipeline and emerges back at the 3rd stage.

There are several problems with this approach.

(1) Deploying the FIFO based Pipeline handlers is out of the question.

(2). With the data going through an "unmanaged" line, anything that requires
stalling or other delay mechanisms (ready being DEASSERTED) is fundamentally in
trouble.

(3) early-in and early-out are also eliminated from consideration.

To "fix" these problems and regain the functionality, the ENTIRE PIPELINE API
MUST BE DUPLICATED WITHIN THE MULTIPLY BLOCK.

A duplicated early-in and early-out system must be added to the multiplier
code.

A duplicated FIFO system must be added to the multiplier code.

A ready/valid signalling system must be added to the multiplier code.

The simplest and quickest way to achieve that: conform to the *already written*
Pipeline and Stage APIs.

> the only part that
> is underutilized is the canceled pipeline slots, which is unavoidable, since
> those slots can't be reused due to instructions needing to go through the
> pipeline stages in sequence.
> 
> > The multiply code needs the same code structure as div_core.
> 
> I agree, however, I think that the div_core is unnecessarily complicated by
> all the extra wrapping for each stage and that it should be more like the
> multiplier and have the data registers internally.

No. The suggestion is very alarming, for several reasons.

Firstly, the graphviz images are *really* clean, right now. I highly recommend
taking a look at eg the fprsqrt16 ILANG, walking through from "show top", then
alu, then pipediv3.

At pipediv3 you see the combinatorial chain of divpipecorecalculate5 (worth
also looking at), divpipecorefinal, and my ineptly named "div1" which I will
fix some time.

Above that you have a suite of "IO Management" signals that handle ready/valid
and mask/stop.

Those "handle" the registers and in NO WAY interact with the actual data.

When we get to doing the INT/FP thing, div1 will be on a bypass path, involving
earlyin/out.

(2) carrying on from the above, div1 cleanly handles the INT to FP conversion.
div0 likewise.

These modules are NOT superfluous to requirements.

They provide CLEAN and SIMPLE subdivision that makes the design easy and clear
to understand.

(3) The subdivision into simple modules has a knock on simplification effect on
the layout / routing.

With wires being specifically routed through blocks that handle them, the job
of doing the actual layout becomes much easier.

Running a global autorouter requires insane computational resources and
produces a hellish rats nest that is indistinguishable from TV "snow".

(3) major structural redesigns, whilst they are something that is going to be
needed, *really* should not be attempted unless absolutely absolutely
necessary.

"Plan for 3 revisions because that's what you will be doing anyway".

1st version: you don't know what you want, and you don't know how to do it.

2nd version: by previously implementing something that *fails* to have the
features you want, you now implement what you *do* want, but still have no idea
how to do it.

3rd version: you already knew what you wanted, but by failing to know how to do
it in the 2nd version, you now *know* how to do it better, so now you do it.

Agile is supposed to blur those lines.

I am barely keeping track, and, more than that, doing major redesigns - moving
to a 2nd or 3rd revision - is going to cause us to fail to meet the milestones
within the budget that we have been given.

(4) we spent 3 months on discussion of the Pipeline API: it is now just about
within the threshold of understandability: adding mask/stop is making a huge
mess of the code already, and I *REALLY* do not want to complicate it further
with extra "features" in what is a 1st revision concept.

*next* funding round.
Not this one.
Please.

> To allow generic registers, the constructor could have a function passed in
> that creates a register for a particular stage. That would allow stuff like
> passing in a function that creates registers for every other stage (to
> handle 2 computational stages per pipeline stage) or handles creating the
> muxes and registers needed for inserting registers only when the processor
> is in high-speed mode (as discussed earlier).

I loved that idea, however liking it and implementing it are two radically
different things.

It needs to go on the "enhancements" list, shelved for another funding round.

Unfortunately.  That said, it *might* be possible to use the Pipeline/Stage API
as-is, with very little in the way of modification.

I just do not think it is a wise thing to explore _right now_ when we have
limited time, people and funds.

> The whole idea is that the computation code should be simple and easy to
> read without all the control signals being required to be in the same Module.

Yes! This is what the Pipeline API and Stage API do! *Please*, look at the
graphviz diagram for divpipecalculate5 and you will see that it is precisely
and exactly that!

> That might also have benefits if the synthesizer lays out logic according to
> which module it's in, allowing all the control pipeline to be moved away
> from the data pipeline when the data pipeline doesn't need to know what the
> control pipeline is doing at every stage.

That is *precisely* and *exactly* what we have *right now* and the multiply
unit *does not conform to it*.

> Note that the multiplier won't specify how many pipeline stages it can take,
> since that number can vary anywhere from 1 (entirely combinatorial) to about
> 18 and it is totally up to the instantiating module.

We cannot do early in early out, we cannot add stalls (for inorder
implementors), I even had an FSM system working at one point *all using the
Stage API and reusing the exact same combinatorial modules*

All of those options and many more are DESTROYED by not having the multiplier
code conform to the Stage API.

Please understand that it is critical that the code conform to the Stage API,
and that means that the use of sync must be prohibited, and sync *only*
permitted to be deployed by a class that derives from ConrrolBase.

Unit tests doing sync is absolutely fine. I noticed in test_core.py that the
code you put together is a near-duplication of ControlBase.connect, and that's
absolutely fine.

However for production purposes, the consequences of not conforming to the
Pipeline API are really severely limiting, cause far more complications than
appear on first glance, and could jeapordise our chances of success.

I've tried to outline as many of those things as I can, above. But please,
*look* at the graphviz diagrams, it is critically important that you understand
how the PipelineAPI works.

I tried to encourage you to do the FPDIV code so that you would begin to
understand how it works: you did not take that opportunity, and so do not yet
fully understand it or accept it.  This is causing problems which we do not
need.

-- 
You are receiving this mail because:
You are on the CC list for the bug.