[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Thu Jan 17 09:38:14 GMT 2019

On Thu, Jan 17, 2019 at 6:12 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Thu, Jan 10, 2019 at 12:34 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > On Thu, Jan 10, 2019 at 1:04 AM Jacob Lifshay <programmerjake at gmail.com>
> > wrote:
> >
> > > I think that adding 16-bit instruction prefixes will be useful to encode
> > > the high bits of the register numbers and extra bits for stuff like
> > > selecting vectorization settings since those will change rapidly enough
> > > that constantly writing to the rename table csrs may use more instruction
> > > bandwidth.
> >
> >  darn it, i was hoping that wouldn't happen.
> >
> >  an alternative is that RVV has a way to set multiple settings at once,
> >  using a pattern.  however SV is a bit more complicated.
> >
> >  another alternative is to have not just one set of CSR settings but
> >  multiple of them, and allow bank-switching.
> >
> >
> > > The encoding I was envisioning will change depending on the underlying
> > > instruction.
> > >
> > > One of the important parts is that a prefixed 16-bit instruction fits in
> > > the 32-bit custom space, a prefixed 32-bit instruction fits in the
> > reserved
> > > 48-bit space, and a prefixed 48-bit instruction fits in the 64-bit space.
> > > This allows them to not conflict with other standard/custom instructions
> > > allowing any instruction to be prefixed.
> >
> >  yes, this concept was discussed (i think) some time last year.
> >  also, it means that Compressed (16-bit) instructions *also* get extended
> >  to only 32-bit, whilst still keeping the prefixes.
> >
> >  however for extending the 16-bit C opcodes, they will need 4 extra
> >  bits (per register) to extend to the full 128 regs.  we may end up using
> > the
> >  entire 48-bit opcode space, although C opcodes have less operands.
> >
> Since C opcodes are only used for compressing commonly used instructions
> and we can use the full opcodes to access everything, we could have the
> prefixed C opcodes only specify some of the registers. Since the prefixed
> versions are going to always be vectorized, I think multiplying the
> register number by 4 or 8 is a good idea, allowing us to use the extra
> instruction bits to specify VL multipliers and other misc things.

 let me think through if that's always true (or likely to be always
true)... predicated scalar-ops: probably not common.  elwidth-adjusted
scalar ops: *maybe* used a lot...

> >  oo, one idea is: on C, still use only 2 bits, and let it be the top 2
> > bits.  so it's xx xxx 00 where xx is the 2-bit bank, xxx is the 3-bit
> > reg num from the C instruction.
> >
> >  the same trick could hypothetically be applied to 32-bit, with say a
> > single 0 in the bottom of the reg num.  the justification: if using
> > this for vectorisation, the group of elements may be aligned on an
> > even boundary (LSB=0) and for C on a "modulo 4 = 0" boundary
> > LSBs='b00)
> >
> > the only issue there is, how do you access the upper registers as scalars?
> >
> I think that accessing the upper registers as scalars will be uncommon
> enough that we can just set VL to 1 and use a vector instruction.

 VL=1 is equivalent to scalar.  i'm leaning towards multiplying by 4 or 8.

> >  the isa-mux scheme may be used to enable / disable the 48/64 prefix
> > extension scheme, which would allow us to use the entire encoding
> > space.  when this bank-prefixing scheme is disabled, the underlying
> > 48/64-bit opcode space becomes "standard" again.
> >
> We need to ensure that we won't need to use the 48/64-bit "standard"
> instructions with SV for that to work. I think it will work better to have
> the same encoding represent the same instruction everytime, allowing us to
> not need a pipeline flush each time we need the other instructions.

 it won't be needed at all.  it's as if 48-bit "standard" instructions
became 49 bit (or 50 bit, or 51 bit).  you don't flush the pipeline
just because of that: you just insert the (hidden) bits into the
decode phase, just as if they had been loaded from the I-Cache.

 there is absolutely no need - at all - to do a pipeline flush.  at all.

> This will also make the compiler/debugger much simpler.

 true... well, given that there aren't any 48 or 64 bit instructions,
i'm not sure if we really need to care.

> > * elwidth setting for FP is quite important.  it's the only way to get
> > FP16 for example, and it's the only way to have the top 32-bits of a
> > 64-bit FP register not be wasted (i.e. pack in 2 FP32 values).
> >
> You forgot that the standard FP instructions already have a 16/32/64/128
> bit selector field that we can use.

 oink??  section 13.2 V20181221-Public-Review-draft "a new supported
format is added to the format field of *MOST* instructions".

 that's new.

 the caveat is: *most* instructions.  not all.  going through the F D
and Q sections, "width" is specified in all... it's just that it's
only S D H and Q.  we can.... probably get away with that.

 does leave integer without an elwidth.... is that such a great loss?
mm.... i'm not so sure it is.

> > 00 means "use the standard 5-bit regs".  that's wasteful of precious
> > encoding space.  i'm reeeeasonably confident that we can think of a
> > use for that.
> >
> On the other hand, it's really useful to be able to encode everything else
> the prefix can do and use it with the standard 32 regs, allowing the
> compiler to treat all the regs the same for vector operations. I would hate
> to have to move data out of the lower 32 regs before we can use vector ops.

 yeah this occurred to me afterwards, as well.  scratch that idea :)

> > it occurs to me that multiple prefixes may be problematic for the
> > instruction decode phase.  it's starting to get into CISC territory.
> > how many prefixings would be needed (or permitted)?
> >
> I'm proposing that we only allow a single prefix and for the encoding space
> that would be multiple prefixes in a row, we reassign it to other
> operations we will need.

 you lost me :)  can you illustrate with an example?

> > oh!  hang on.... something else just occurred to me: by having the
> > above alternative prefix encodings, it's possible to strip off (and
> > use) the bits from the standard 16-bit and 32-bit encoding.  that
> > means an extra 2 bits for a 16-bit op, and a full 5 bits for a 32-bit
> > op.  in the 32-bit case that's actually enough to be able to specify a
> > predicate (0 meaning "no predicate").
> >
> Actually, 16-bit ops use all their bits, there are not any constant bits
> that we can reassign.

 yeah i realised this once i'd looked more closely at the instruction
listing tables: those 2 bits are used to specify which quadrant is to
be used.

 it occurred to me yesterday that perhaps using the reserved opcode
{15-13=0b100} {12 free bits} {1-0=0b00} may actually be a better
all-round option here, on the basis that we get 2 more bits than if
using a 48-bit instruction encoding, and if it's a "C" encoding we can
do a type of "macro-op fusion".

 or, just have a state machine which reads C opcodes, sets up some
"state" that is cleared after the next-instruction-but-one.

>  32-bit ops have the 2 LSB bits that we can reassign.

 ... which are set to 0b11.  it's better than nothing :)

l.