[libre-riscv-dev] building a simple barrel processor

Fri Mar 8 10:22:29 GMT 2019

On Fri, Mar 8, 2019 at 9:03 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
> >  we'd therefore need to completely change the design strategy to a
> > dual (split) CPU + GPU,
>
> add forwarding and skip idle harts (defined as harts executing wfi), could
> have the low registers have more ports or maybe a 4-8 reg-per-hart 3r1w
> cache.
>
> could alternatively have only first hart in each core have a fast mode,
> linux can handle that thanks to ARM's bigLITTLE support in the scheduler
> (as of 5.0).

 it's getting complicated, already, isn't it?  and, whatever it is: at
its core, the design is single-issue in-order pipelining, which is
already a red flag.  on top of the already-escalating complexity, we
have to fit SV (as a non-SIMD architecture), *and* fit subdivisions of
the register file and the ALUs down to the 8-bit variable-length
vectorisation level

 i've spent something like 5 months research and planning the
multi-issue OoO design, including weeks of research into a strategy
that will allow us to handle instruction issue down to the 8-bit
vector level, *without* interfering with the use of the exact same
register file for 16, 32 and 64-bit: they co-exist.  that's incredibly
unusual.

in-order pipelines with SIMD-like underlying instructions were already
rejected at a very early part of the decision tree.

virtually *none* of the planning and research into the multi-issue OoO
design will transfer over to a single-order in-order pipeline design.

if the barrel processor had any significant advantages (aside from the
uniformity) i'd be jumping at it with enthusiasm.  if we had known of
its existence six months ago, i would have welcomed a comprehensive
and full analysis.

as things stand, however, i am not seeing any significant advantages:
i am seeing instead primarily disadvantages, particularly compared to
having to abandon literally months of design and research effort into
a much more flexible, comprehensive, and expandable design, and
starting almost entirely from scratch.

examples of what constitutes a better design include the Q-Table
"History" addition, which is an innovation even above and beyond what
Intel, ARM and AMD have ever come up with.  the Q-Table "History"
allows precise nameless register renaming, where the removal of
register names provides an opportunity to skip register writes
entirely [operand forwarding on steroids].

the normal methods by which the same end-result is achieved is to either:

* have a complete periodic [snapshot] "Historic State", to which the
entire register and CSR state is "rolled back".
* destroy the ENTIRETY of the current Function Unit Reservation State,
roll back dozens of instructions, wait for the processor to stabilise,
then proceed in SINGLE issue mode very slowly, switching off
operand-forwarding and other critical power-reducing and performance
optimisations.

such a feature would be flat-out impossible to add to a single-issue
in-order pipelined design, as the whole concept is critically
dependent on the dynamic analysis of multiple in-flight instructions,
based on the allocation to Reservation Stations / Function Units.  a
single-issue in-order pipeline *has* no in-flight instructions, and
*has* no Reservation Stations or Function Units.

what i'm trying to get across here is: by comparison, a barrel
processor is a huge technological step backwards, and is, i feel, a
completely wrong fit for use as a hybrid CPU-GPU, and would be complex
and require abandoning half a year's research if used as a dedicated
GPU.

by contrast, even with the low clock rate, it *is* however perfect as
an IO bit-banging soft-implementation of peripherals, and that's
traditionally exactly what it's been used for.

also, i remember now: i discussed / evaluated the idea of the
single-porting with mitch alsup.  he suggested using reduced-ported
register files and extending the pipeline to read op1 as a first
stage, op2 as a second stage, op3 as a third, then have a 4-stage FMAC
and finally a write stage, for a total of 8 stages (instead of the
normal 6, where ops 1/2/3 are potentially done in a single stage).

the problem was: to get the equivalent performance, you needed *FOUR*
times the data.  i.e. if you wanted to do 4-vectors, the concept
forces you to do *four* 4-vectors at once.

> > and have the kazan codebase modified to
> > include an IPC/RPC mechanism that was capable of packaging up all API
> > calls, shipping them over to the GPU and having it execute things
> > there.
> >
> We don't need IPC/RPC. we can still share all the memory and be inside the
> same process and use all the standard inter-thread synchronization
> mechanisms. sharing memory like that happens on most mobile gpus anyway.
> I'm implementing inter-thread communication anyway since we want the gpu
> work to not be stuck on a single core.

 iinteresting.  okay.

l.