[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Jacob Lifshay programmerjake at gmail.com
Thu Jan 24 06:54:25 GMT 2019


One other thing we will want to keep in mind is the dynamic linker and the
static linkers, we need to ensure that any relocations we emit for our new
instructions work properly, either by ensuring they are compatible with the
existing software (preferred) or that we add new relocation types.

Jacob

On Wed, Jan 23, 2019, 01:49 Jacob Lifshay <programmerjake at gmail.com wrote:

> On Wed, Jan 23, 2019, 00:59 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> wrote:
>
>> ---
>> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>>
>> On Wed, Jan 23, 2019 at 4:13 AM Jacob Lifshay <programmerjake at gmail.com>
>> wrote:
>>
>> > >  i'm not sure if you're aware how i implemented spike-sv, or how
>> > > crucial the strategic goal of not modifying RV base is, and how that
>> > > influenced how SV was designed.
>> > >
>> > Yeah, I'm aware how you implemented spike-sv. The benefit of not adding
>> > extra instructions is that binutils doesn't need to be changed at all.
>> The
>> > compilers will still need a lot of work to work with the CSR-only system
>> > (assuming they need to take advantage of vectorization).
>> >
>> > I'm proposing the prefix system because I think that the CSR-only
>> system is
>> > insufficient for GPU applications since the CSRs will need to be changed
>> > every few instructions.
>>
>>  really? destination vector-registers will become source registers for
>> the following instruction.  and it's possible to set 4 at a time, plus
>> it's stack-based with a "restricted window" that can mask out parts of
>> the "stack".
>>
>>  what i like about the prefix concept is that yes there may be
>> circumstances where it's more efficient to not have the push-CSR,
>> instruction, pop-CSR.... yet at the same time there will also be
>> circumstances where it's more efficient to have push-CSR, instruction
>> instruction instruction pop-CSR.
>>
>> > Note that I am Not proposing that the unprefixed instructions change
>> their
>> > meaning,
>>
>>  i know that.  it concerns me that the *prefixed* instructions have
>> their meaning changed.
>>
>> > What I am proposing is that the prefixed versions of jal (and similar)
>> > could be reallocated.
>>
>>  ... could be: i'd like to explore alternatives that avoid the need to
>> do so, though.
>>
>> > Since there isn't any reason to have prefixed
>> > versions of it, being basically impossible to vectorize, we could use it
>> > for something else instead of wasting that encoding space.
>>
>>  yehyeh this aspect is what i like.
>>
>> > > consequently i was able to complete spike-sv within i think it was
>> > > around 6-8 weeks.
>> > >
>> > The effort required for implementing spike-sv doesn't reflect the effort
>> > required for the compiler,
>>
>>  it's really really important to get the simulator done, as it gives
>> firm confidence to be able to move to the next phase.  plus it ensures
>> that the *entire* concept is walked, documented 100%, and unit tested
>> 100%.
>>
>>  *before* spending the time and money committing to the much more
>> expensive compiler modifications.
>>
>>  plus, it gives a clear indicator - scaled down - of how much time and
>> effort is going to have to be put into the compiler (and the
>> hardware).
>>
>> > which I'm estimating as at least 3-6 mo for LLVM
>> > only, with having the prefixed instructions instead of CSR-only reducing
>> > the time to implement by 2-3 weeks. LLVM has an integrated assembler,
>> so we
>> > shouldn't need to modify binutils to get LLVM to work, we can leave that
>> > for later.
>> >
>> > If I were to sort the software project portions by how much time I
>> think I
>> > would need to complete them:
>> > With CSR-only:
>> > binutils: 0 days (no changes needed)
>> > spike-sv: maybe a week of additional time to fix bugs, etc. using your
>> code
>> > as a base.
>> > Linux kernel: 2-3 weeks
>> > LLVM: 4-7 months
>> > GCC: 5-7 months (unless we get someone more familiar with GCC's
>> internals)
>> >
>> > With prefixes only:
>>
>> > binutils: 3-4 weeks
>> > spike-sv: 3-4 weeks
>>
> These estimations are how long it would take me to write the code and some
> tests and docs from already mostly complete specs. they may be biased since
> when I was younger, I would often write code for 12+ hr/day.
>
>>
>>  NO chance.  absolutely no chance.  there's documentation, unit tests,
>> discussion *and* binutils and spike-sv needed, not just binutils and
>> spike-sv.  we have to add a 48-bit prefix system (spike only does 16
>> and 32 at the moment), there's a new decode engine needed, new opcodes
>> - it's a *lot*.
>>
> for decoding prefixes, we can mostly have them recursively call the
> function to decode an instruction. Spike already supports decoding 48 and
> 64-bit instructions.
>
>>
>>  i'm estimating 4-6 MONTHs for that effort, not 1.5 to 2!
>>
>>  SV initially took 2-3 months of discussion and documentation on
>> isa-dev.  spike-sv took 2 months (10 weeks: sep 24th - nov 29th)
>> including the unit tests and updating the documentation.  daniel
>> helped, so it was about 1-2 more man-weeks than that.
>>
>>  so basically, doubling the amount of time (at least), gives us an
>> indicator that when it comes to doing the hardware and the LLVM
>> compiler work, that would approximately be doubled as well, when
>> compared to *not* modifying the encoding [and so on].
>>
> Most of the LLVM work would be independent of how the instructions are
> actually encoded, so that part won't change.
> The part that makes the CSR design take longer is that there are new
> levels of register allocation needed to handle the rename table, LLVM is
> not set up to implement that kind of thing currently.
> The prefix scheme won't need multiple levels of register allocation since
> the registers are directly specified in the instructions.
>
>>
>>  SV has been very, very carefully designed to simply add an extra
>> "dimension" to pre-existing RV instructions.  *NOT* to change the
>> meaning and encoding.
>>
>>  this wasn't done just because it's fun and elegant to do so: it was
>> done *specifically* to save a huge amount time and development effort
>> in both the software and the hardware.  several things that i really
>> wanted to add had to be ripped out in order to keep strictly to that
>> design strategy.
>>
>>  even things like the broadcast (VSPLAT), they're done by issuing
>> multiple *scalar* RV operations into the instruction issue.  multiple
>> *unmodified* scalar RV operations.
>>
>>  apart from the elwidth over-riding, external tweaking on
>> Branch-compare and LD/ST, SV could quite literally be implemented as a
>> hardware for-loop around an unmodified RV scalar processor.
>>
> Yup.
>
>>
>>
>> > Note that we can leave GCC for later.
>>
>>  yes.  fortunately.
>>
>> > To implement the Linux kernel code, we will want at least binutils
>> > implemented so we can write assembly to save/restore state, or we can
>> try
>> > and compile the kernel using clang (may take just as long to get
>> working as
>> > implementing binutils).
>>
>>  if it's interoperable (which it will be) i'm inclined to leave the
>> linux kernel for now, as a low-priority task.
>
> ok
>
>>   and if there are only
>> CSRs needed to save the state, then binutils modification is not
>> needed.
>
>
>>  this is why i would like the prefix system to "shoe-horn" into the
>> existing SV (over-ride it).  i.e. the prefix system is maps *onto* the
>> CSRs as a pure "override" of individual CSRs.
>>
>> > >  copies of the entirety of the RV opcodes - those which are to remain
>> > > scalar - need to be made.
>> > >
>> > We would just use the unprefixed versions when we needed scalar
>> operations,
>>
>>  ... which means that scalar operations on the top regs, plus elwidth
>> overrides are no longer possible (etc.).  which defeats the object of
>> the exercise.
>>
>> > or we could set VL to 1 or use the vlp4 scalar option.
>>
>>  so i went back to the original message (took me a while to find it),
>> and the original idea that was raised was to use the
>> "un-vectoriseable" instructions (jal etc.) to do vector-scalar
>> operations, not scalar-scalar operations.
>>
>>  so that implies that the ENTIRETY of the RV opcode space has to be
>> mirrored / re-mapped / redesigned to fit into the "un-vectoriseable"
>> [prefixed] space.
>>
>>  as in: we have to fit 100 or so vector-scalar *and* 100 or so
>> scalar-vector opcodes into like... 5?  8?
>>
>>  so that's why i recommended the use of 0b00 within the prefix to
>> specify scalar, and we discussed in the previous message to do a
>> slightly different encoding (which i liked), bit 0 indicating "scalar"
>> and to use x0-x63, and bit 1 indicating "vector" and to do x0-x127
>> with bit 0 always set to 0.
>>
> I agree that the register numbers to do scalar is better since we need the
> space for other things like predication.
>
>>
>>  then the need to over-ride becomes unnecessary.
>>
>> > >  so how can the entirety of the RV opcode space - around a hundred
>> > > instructions - fit into a few (reassigned) opcodes?
>> > >
>> > >  or, were you envisioning only doing a few opcodes?  or some new ones?
>> > >
>> > > > You can look at it this way: we can always run any op using
>> broadcast
>> > > then
>> > > > the vector-vector version, so if we make the most common ops have a
>> > > > vector-scalar mode, then that is similar to C in that it saves
>> space,
>> > > time,
>> > > > energy, etc. but it doesn't make the underlying vectorized
>> operation more
>> > > > or less possible, since you can always use broadcast instructions to
>> > > > convert any scalar inputs to vectors then run the vector-vector
>> version.
>> > >
>> > >  ok so a 2-step process?
>> > >
>> > Yup.
>> >
>> > >
>> > > > >
>> > > > > Plus all future opcodes.
>> > > > >
>> > > > all future opcodes that use the OP, OP-32, or FP opcodes will have
>> > > > vector-scalar versions (includes at least part of the B extension),
>> all
>> > > > others (assuming we don't reassign more) will have only
>> vector-vector
>> > > > versions and will need a broadcast for scalar args.
>> > >
>> > >  do you mean, we have to make a broadcasting opcode which takes scalar
>> > > ops and broadcasts them to vector destination regs?
>> > >
>> > I think having a broadcast op is necessary anyway. If we can code for
>> any
>> > register argument being scalar instead of vector, we can just use a mv
>> with
>> > the dest vector and the src scalar.
>>
>>  the CSR SV, yes, this is what would be done: a vectorised MV is
>> twin-predicated, so it covers the entire range: VINSERT, VREDUCE,
>> VSPLAT, VGATHER, VSCATTER - everything.
>>
> doesn't actually cover all of vgather/vscatter:
> try implementing:
> for(i = 0; i < VL; i++)
> {
>     x[i] = x[bitreverse(i)];
> }
> which can be done with a single vgather/vscatter but not with twin
> predication.
>
>>
>>  twin-predication will be a pain in the ass to implement in hardware:
>> some of the modes, a first implementation will need to be done as a
>> single-issue hardware for-loop, they're that obtuse (particularly the
>> non-zeroing twin-predicate one).
>>
>>  so the envisaged broadcast (VSPLAT) op already exists... *without*
>> needing MV to be modified.
>>
>> > That's what happens anytime you add something (anything) to the ISA.
>> I'll
>> > try my best to avoid permanent forks.
>>
>>  as this is a libre project, there will be pressure on upstream to
>> take patches, that won't come from us, it will come from the
>> end-users, clamouring for upstream support.  debian certainly will not
>> host two copies of gcc or two copies of LLVM, so when there are
>> literally thousands to tens of thousands of developers using this
>> family of processors, any resistance will be overcome.
>>
> good to be optimistic.
>
> Jacob
>


More information about the libre-riscv-dev mailing list