[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Wed Jan 23 04:13:06 GMT 2019

On Tue, Jan 22, 2019 at 2:00 PM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

>  one option worth investigating: use the lower-numbered registers
> (x0-x31) to indicate implicitly that they are to be scalar.  i.e. when
> the extension prefix (for either src or dest) is 0b00, it indicates
> "these are scalar not vector".
>
See previous reply (few min ago).

> > > but may need to change the encoding for
> > > > the more common operations to accommodate vector-scalar modes for
> > > > power-efficiency and lower register pressure.
> > >
> > >
> > > Changing the encoding has huge software implications.
> > >
> > Yeah, but they are easily manageable at this stage of the process since
> we
> > haven't started the instruction decoder or compiler. Since we are only
> > changing how the prefixed instructions are encoded, it shouldn't change
> > much with spike, since we will still need to add decoding the prefixed
> > instructions anyway.
>
>  i'm not sure if you're aware how i implemented spike-sv, or how
> crucial the strategic goal of not modifying RV base is, and how that
> influenced how SV was designed.
>
Yeah, I'm aware how you implemented spike-sv. The benefit of not adding
extra instructions is that binutils doesn't need to be changed at all. The
compilers will still need a lot of work to work with the CSR-only system
(assuming they need to take advantage of vectorization).

I'm proposing the prefix system because I think that the CSR-only system is
insufficient for GPU applications since the CSRs will need to be changed
every few instructions.

Note that I am Not proposing that the unprefixed instructions change their
meaning, otherwise we could just design our whole new ISA however we like
and add a RISC-V compatibility mode.

What I am proposing is that the prefixed versions of jal (and similar)
could be reallocated. Since there isn't any reason to have prefixed
versions of it, being basically impossible to vectorize, we could use it
for something else instead of wasting that encoding space.

>
>  in spike-sv there *are* no changes to the decode phase.  at all.
> aside from element width over-rides (which are done in a "global"
> overview fashion) there's absolutely no changes to the meaning of the
> spike emulated-instructions compared to their scalar variants, either,
> with one exception that's handled in an "overview" (modular-like)
> fashion, and that's branch-compare operations.
>
>  that means that in turn there are absolutely no changes - whatsoever
> - to binutils.  absolutely none.
>
>  the modifications to add element-width overrides were done through
> turning various critical strategic macros (zero and sign extension in
> particular) into functions, that were added to a c++ class, that then
> "redirected" their arguments through a processing phase.
>
>  i *did not touch* the ALU side of spike.
>
> i *did not alter* the decode phase *at all*.
>
> consequently i was able to complete spike-sv within i think it was
> around 6-8 weeks.
>
The effort required for implementing spike-sv doesn't reflect the effort
required for the compiler, which I'm estimating as at least 3-6 mo for LLVM
only, with having the prefixed instructions instead of CSR-only reducing
the time to implement by 2-3 weeks. LLVM has an integrated assembler, so we
shouldn't need to modify binutils to get LLVM to work, we can leave that
for later.

If I were to sort the software project portions by how much time I think I
would need to complete them:
With CSR-only:
binutils: 0 days (no changes needed)
spike-sv: maybe a week of additional time to fix bugs, etc. using your code
as a base.
Linux kernel: 2-3 weeks
LLVM: 4-7 months
GCC: 5-7 months (unless we get someone more familiar with GCC's internals)

With prefixes only:
Linux kernel: 2-3 weeks
binutils: 3-4 weeks
spike-sv: 3-4 weeks
LLVM: 3-6 months
GCC: 5-7 months

With both:
Linux kernel: 2-3 weeks
binutils: 3-4 weeks
spike-sv: 3-4 weeks
LLVM: 4-8 months
GCC: 6-8 months

Note that we can leave GCC for later.

To implement the Linux kernel code, we will want at least binutils
implemented so we can write assembly to save/restore state, or we can try
and compile the kernel using clang (may take just as long to get working as
implementing binutils).

> > > > We could use the prefixed jal encoding as a different opcode for
> > > > vector/scalar as jal is useless when vectorized.
> > >
> > >
> > > >
> > > There are a ton of non-vectoriseable ops, the problem is that there are
> > > nowhere near enough.
> > >
> > > I like the idea, the problem is that the entire vectoriseable opcode
> space
> > > needs to be fitted into the overloaded space.
> > >
> > It all still fits, we're just reassigning the non-vectorizeable portions.
>
>  the reassignment *is* a huge step in and of itself (which has me
> concerned as to the cost of development of the associated
> modifications to llvm, gcc and binutils), and i'm not sure if we're
> understanding correctly.

>  copies of the entirety of the RV opcodes - those which are to remain
> scalar - need to be made.
>
We would just use the unprefixed versions when we needed scalar operations,
or we could set VL to 1 or use the vlp4 scalar option.

>
>  so how can the entirety of the RV opcode space - around a hundred
> instructions - fit into a few (reassigned) opcodes?
>
>  or, were you envisioning only doing a few opcodes?  or some new ones?
>
> > You can look at it this way: we can always run any op using broadcast
> then
> > the vector-vector version, so if we make the most common ops have a
> > vector-scalar mode, then that is similar to C in that it saves space,
> time,
> > energy, etc. but it doesn't make the underlying vectorized operation more
> > or less possible, since you can always use broadcast instructions to
> > convert any scalar inputs to vectors then run the vector-vector version.
>
>  ok so a 2-step process?
>
Yup.

>
> > >
> > > Plus all future opcodes.
> > >
> > all future opcodes that use the OP, OP-32, or FP opcodes will have
> > vector-scalar versions (includes at least part of the B extension), all
> > others (assuming we don't reassign more) will have only vector-vector
> > versions and will need a broadcast for scalar args.
>
>  do you mean, we have to make a broadcasting opcode which takes scalar
> ops and broadcasts them to vector destination regs?
>
I think having a broadcast op is necessary anyway. If we can code for any
register argument being scalar instead of vector, we can just use a mv with
the dest vector and the src scalar.

>
>  i think the idea of setting x0-x31 as implicitly being scalar (as
> source or dest, i.e. when the extension prefix = 0b00) would achieve
> the same thing.
>
> > > So by overriding the jal and other space, the implications are as
> follows:
> > >
> > > * the rule about not modifying opcodes has been discarded, which
> implies a
> > > massive amount of compiler work
> > >
> > not much more than would be otherwise needed since (from what I recall)
> all
> > the RISCV instruction selection code only works with scalars right now
> > anyway.
>
>  if there are _any_ it automatically means maintaining a temporary
> fork of gcc, binutils and llvm.  that means that resources have to be
> committed to convincing upstream developers to accept the patches.
>
I had been planning the whole time on having a temporary fork of gcc,
binutils, llvm, linux, and gdb while we work on getting the changes
upstream.

>
>  that in turn requires a full audit and review process, and if they
> don't like them, and it's too late, resources have to be committed
> instead to a *permanent* hard fork of gcc, binutils and llvm.
>
That's what happens anytime you add something (anything) to the ISA. I'll
try my best to avoid permanent forks.

Jacob