[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Luke Kenneth Casson Leighton lkcl at lkcl.net
Wed Jan 23 08:53:43 GMT 2019


---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Wed, Jan 23, 2019 at 4:13 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> >  i'm not sure if you're aware how i implemented spike-sv, or how
> > crucial the strategic goal of not modifying RV base is, and how that
> > influenced how SV was designed.
> >
> Yeah, I'm aware how you implemented spike-sv. The benefit of not adding
> extra instructions is that binutils doesn't need to be changed at all. The
> compilers will still need a lot of work to work with the CSR-only system
> (assuming they need to take advantage of vectorization).
>
> I'm proposing the prefix system because I think that the CSR-only system is
> insufficient for GPU applications since the CSRs will need to be changed
> every few instructions.

 really? destination vector-registers will become source registers for
the following instruction.  and it's possible to set 4 at a time, plus
it's stack-based with a "restricted window" that can mask out parts of
the "stack".

 what i like about the prefix concept is that yes there may be
circumstances where it's more efficient to not have the push-CSR,
instruction, pop-CSR.... yet at the same time there will also be
circumstances where it's more efficient to have push-CSR, instruction
instruction instruction pop-CSR.

> Note that I am Not proposing that the unprefixed instructions change their
> meaning,

 i know that.  it concerns me that the *prefixed* instructions have
their meaning changed.

> What I am proposing is that the prefixed versions of jal (and similar)
> could be reallocated.

 ... could be: i'd like to explore alternatives that avoid the need to
do so, though.

> Since there isn't any reason to have prefixed
> versions of it, being basically impossible to vectorize, we could use it
> for something else instead of wasting that encoding space.

 yehyeh this aspect is what i like.

> > consequently i was able to complete spike-sv within i think it was
> > around 6-8 weeks.
> >
> The effort required for implementing spike-sv doesn't reflect the effort
> required for the compiler,

 it's really really important to get the simulator done, as it gives
firm confidence to be able to move to the next phase.  plus it ensures
that the *entire* concept is walked, documented 100%, and unit tested
100%.

 *before* spending the time and money committing to the much more
expensive compiler modifications.

 plus, it gives a clear indicator - scaled down - of how much time and
effort is going to have to be put into the compiler (and the
hardware).

> which I'm estimating as at least 3-6 mo for LLVM
> only, with having the prefixed instructions instead of CSR-only reducing
> the time to implement by 2-3 weeks. LLVM has an integrated assembler, so we
> shouldn't need to modify binutils to get LLVM to work, we can leave that
> for later.
>
> If I were to sort the software project portions by how much time I think I
> would need to complete them:
> With CSR-only:
> binutils: 0 days (no changes needed)
> spike-sv: maybe a week of additional time to fix bugs, etc. using your code
> as a base.
> Linux kernel: 2-3 weeks
> LLVM: 4-7 months
> GCC: 5-7 months (unless we get someone more familiar with GCC's internals)
>
> With prefixes only:

> binutils: 3-4 weeks
> spike-sv: 3-4 weeks

 NO chance.  absolutely no chance.  there's documentation, unit tests,
discussion *and* binutils and spike-sv needed, not just binutils and
spike-sv.  we have to add a 48-bit prefix system (spike only does 16
and 32 at the moment), there's a new decode engine needed, new opcodes
- it's a *lot*.

 i'm estimating 4-6 MONTHs for that effort, not 1.5 to 2!

 SV initially took 2-3 months of discussion and documentation on
isa-dev.  spike-sv took 2 months (10 weeks: sep 24th - nov 29th)
including the unit tests and updating the documentation.  daniel
helped, so it was about 1-2 more man-weeks than that.

 so basically, doubling the amount of time (at least), gives us an
indicator that when it comes to doing the hardware and the LLVM
compiler work, that would approximately be doubled as well, when
compared to *not* modifying the encoding [and so on].

 SV has been very, very carefully designed to simply add an extra
"dimension" to pre-existing RV instructions.  *NOT* to change the
meaning and encoding.

 this wasn't done just because it's fun and elegant to do so: it was
done *specifically* to save a huge amount time and development effort
in both the software and the hardware.  several things that i really
wanted to add had to be ripped out in order to keep strictly to that
design strategy.

 even things like the broadcast (VSPLAT), they're done by issuing
multiple *scalar* RV operations into the instruction issue.  multiple
*unmodified* scalar RV operations.

 apart from the elwidth over-riding, external tweaking on
Branch-compare and LD/ST, SV could quite literally be implemented as a
hardware for-loop around an unmodified RV scalar processor.


> Note that we can leave GCC for later.

 yes.  fortunately.

> To implement the Linux kernel code, we will want at least binutils
> implemented so we can write assembly to save/restore state, or we can try
> and compile the kernel using clang (may take just as long to get working as
> implementing binutils).

 if it's interoperable (which it will be) i'm inclined to leave the
linux kernel for now, as a low-priority task.  and if there are only
CSRs needed to save the state, then binutils modification is not
needed.

 this is why i would like the prefix system to "shoe-horn" into the
existing SV (over-ride it).  i.e. the prefix system is maps *onto* the
CSRs as a pure "override" of individual CSRs.

> >  copies of the entirety of the RV opcodes - those which are to remain
> > scalar - need to be made.
> >
> We would just use the unprefixed versions when we needed scalar operations,

 ... which means that scalar operations on the top regs, plus elwidth
overrides are no longer possible (etc.).  which defeats the object of
the exercise.

> or we could set VL to 1 or use the vlp4 scalar option.

 so i went back to the original message (took me a while to find it),
and the original idea that was raised was to use the
"un-vectoriseable" instructions (jal etc.) to do vector-scalar
operations, not scalar-scalar operations.

 so that implies that the ENTIRETY of the RV opcode space has to be
mirrored / re-mapped / redesigned to fit into the "un-vectoriseable"
[prefixed] space.

 as in: we have to fit 100 or so vector-scalar *and* 100 or so
scalar-vector opcodes into like... 5?  8?

 so that's why i recommended the use of 0b00 within the prefix to
specify scalar, and we discussed in the previous message to do a
slightly different encoding (which i liked), bit 0 indicating "scalar"
and to use x0-x63, and bit 1 indicating "vector" and to do x0-x127
with bit 0 always set to 0.

 then the need to over-ride becomes unnecessary.

> >  so how can the entirety of the RV opcode space - around a hundred
> > instructions - fit into a few (reassigned) opcodes?
> >
> >  or, were you envisioning only doing a few opcodes?  or some new ones?
> >
> > > You can look at it this way: we can always run any op using broadcast
> > then
> > > the vector-vector version, so if we make the most common ops have a
> > > vector-scalar mode, then that is similar to C in that it saves space,
> > time,
> > > energy, etc. but it doesn't make the underlying vectorized operation more
> > > or less possible, since you can always use broadcast instructions to
> > > convert any scalar inputs to vectors then run the vector-vector version.
> >
> >  ok so a 2-step process?
> >
> Yup.
>
> >
> > > >
> > > > Plus all future opcodes.
> > > >
> > > all future opcodes that use the OP, OP-32, or FP opcodes will have
> > > vector-scalar versions (includes at least part of the B extension), all
> > > others (assuming we don't reassign more) will have only vector-vector
> > > versions and will need a broadcast for scalar args.
> >
> >  do you mean, we have to make a broadcasting opcode which takes scalar
> > ops and broadcasts them to vector destination regs?
> >
> I think having a broadcast op is necessary anyway. If we can code for any
> register argument being scalar instead of vector, we can just use a mv with
> the dest vector and the src scalar.

 the CSR SV, yes, this is what would be done: a vectorised MV is
twin-predicated, so it covers the entire range: VINSERT, VREDUCE,
VSPLAT, VGATHER, VSCATTER - everything.

 twin-predication will be a pain in the ass to implement in hardware:
some of the modes, a first implementation will need to be done as a
single-issue hardware for-loop, they're that obtuse (particularly the
non-zeroing twin-predicate one).

 so the envisaged broadcast (VSPLAT) op already exists... *without*
needing MV to be modified.

> That's what happens anytime you add something (anything) to the ISA. I'll
> try my best to avoid permanent forks.

 as this is a libre project, there will be pressure on upstream to
take patches, that won't come from us, it will come from the
end-users, clamouring for upstream support.  debian certainly will not
host two copies of gcc or two copies of LLVM, so when there are
literally thousands to tens of thousands of developers using this
family of processors, any resistance will be overcome.

l.



More information about the libre-riscv-dev mailing list