[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Thu Jan 10 07:40:32 GMT 2019

On Thu, Jan 10, 2019 at 1:04 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> I think that adding 16-bit instruction prefixes will be useful to encode
> the high bits of the register numbers and extra bits for stuff like
> selecting vectorization settings since those will change rapidly enough
> that constantly writing to the rename table csrs may use more instruction
> bandwidth.

 darn it, i was hoping that wouldn't happen.

 an alternative is that RVV has a way to set multiple settings at once,
 using a pattern.  however SV is a bit more complicated.

 another alternative is to have not just one set of CSR settings but
 multiple of them, and allow bank-switching.

> The encoding I was envisioning will change depending on the underlying
> instruction.
>
> One of the important parts is that a prefixed 16-bit instruction fits in
> the 32-bit custom space, a prefixed 32-bit instruction fits in the reserved
> 48-bit space, and a prefixed 48-bit instruction fits in the 64-bit space.
> This allows them to not conflict with other standard/custom instructions
> allowing any instruction to be prefixed.

 yes, this concept was discussed (i think) some time last year.
 also, it means that Compressed (16-bit) instructions *also* get extended
 to only 32-bit, whilst still keeping the prefixes.

 however for extending the 16-bit C opcodes, they will need 4 extra
 bits (per register) to extend to the full 128 regs.  we may end up using the
 entire 48-bit opcode space, although C opcodes have less operands.

 with 32-bit instructions, only 2 extra bits per register as a prefix are
 needed (as you outline below)

 oo, one idea is: on C, still use only 2 bits, and let it be the top 2
bits.  so it's xx xxx 00 where xx is the 2-bit bank, xxx is the 3-bit
reg num from the C instruction.

 the same trick could hypothetically be applied to 32-bit, with say a
single 0 in the bottom of the reg num.  the justification: if using
this for vectorisation, the group of elements may be aligned on an
even boundary (LSB=0) and for C on a "modulo 4 = 0" boundary
LSBs='b00)

the only issue there is, how do you access the upper registers as scalars?

> For 32-bit underlying instructions, we can use the two lsb bits in the
> underlying instruction that specify that the instruction is 32 bits as
> extra bits:
>
> 0x00b5_0533 add x10, x10, x11
> becomes
> 0x00b5_0530_001f add x10, x10, x11 with 12 available bits (some of which we
> will need to leave constant for other uses of the 48-bit space).

 there's a way round that, called the "isa-mux" scheme.  it's similar
to the proposed prefix scheme except it's "hidden" ISA
opcode-extending-bits that apply persistently rather than temporarily.

 the isa-mux scheme may be used to enable / disable the 48/64 prefix
extension scheme, which would allow us to use the entire encoding
space.  when this bank-prefixing scheme is disabled, the underlying
48/64-bit opcode space becomes "standard" again.

> I think we should use them this way:
> 2 for each of rs1, rs2, and rd to allow addressing 128 registers
> 2 for specifying a vl multiplier of 1x, 2x, 3x, or 4x
> 1 for selecting predicated/non-predicated with a fixed predicate register
> of x9/s1 (in the range of rvc registers and not reserved for something else)
> 2 for:
>     for 4 arg instructions like fma, 2 high bits of rs3
>     for integer, selecting packed modes from 8-bit, 16-bit, 32-bit, and
> 64-bit
>     we can pick something for other instruction types
> 1 as constant to allow other 48-bit instructions

couple of comments:

* setting VL and keeping it set across a range of instructions, it's
clear and explicit.  VL is a persistent global setting, basically.
usually if VL is set, it's definitely going to be used for a loop.
however... i *can* see the value of a "one-off" VL override (not in
loops, for example).

* by removing VL it actually becomes possible to consider proposing
this as a general-purpose RISC-V extension.

* the 2 bits for packed-mode being dependent on the (future) opcode:
this is a red flag, for me (makes me nervous).  it complicates the
decoder phase.  everything else proposed may be extracted using a few
gates, and stored in latches that the *next* part of the instruction
decoder may use.  i'd only be happy with this if it was a last resort.

* elwidth setting for FP is quite important.  it's the only way to get
FP16 for example, and it's the only way to have the top 32-bits of a
64-bit FP register not be wasted (i.e. pack in 2 FP32 values).

i wonder if one of the bits is best used to set the "type" of
extension.  by that i mean, if a bit is set, it indicates that
predication is to be set.  this would allow one prefix to specify a
predicate (in full, rather than only to use one hard-coded register).
however, the encoding space is so extremely small (see below) that it
may be better to use the 64-bit opcode space for specifying
predication.

also... given the extremely limited space, i wonder if it's a good
idea to have a 2-bit prefix for rd and a 2-bit prefix for *all*
rs1/2/3 registers?  that would allow a kind-of... bank-swapping.  a
2-bit prefix for *all* rd and rs1/2/3 would result in complete
isolation of registers into any given "bank", whereas 2-bit for src
and 2-bit for dest would allow a sequence of ops to access multiple
"banks".

oh: also... dang there's a lot here... :)

00 means "use the standard 5-bit regs".  that's wasteful of precious
encoding space.  i'm reeeeasonably confident that we can think of a
use for that.

> We can come up with something similar for 16 and 48-bit underlying
> instructions.
>
> Note that we won't end up with the problems with SIMD always needing to add
> more instructions

 [thank goodness... :) ]

> since the list of element types isn't going to expand and
> all of the instructions are vectorized with predication and variable vl.
>
> The prefixed instructions would bypass the SV rename table since the prefix
> specifies the high register bits and the predication.

 i'd advocate still _allowing_ the SV rename table to apply, in
instances where it's being used, however that for entries which have
been prefixed, the prefix takes top precedence.  i haven't thought it
through, though.

 the reason i like the SV CSR table setup (which is now a "stack") is,
it applies to multiple registers.  there will be circumstances where
that's more efficient.  just as there will be circumstances where this
prefixing idea is more efficient.

> Multiple prefixes in a single instruction are reassigned to operations like
> reduction, packed type conversions, indexed/strided ld/st and others as
> needed.

it occurs to me that multiple prefixes may be problematic for the
instruction decode phase.  it's starting to get into CISC territory.
how many prefixings would be needed (or permitted)?

an optimisation of this approach is to use a 64-bit encoding to hold a
32-bit instruction.

 or, even a 48-bit encoding to hold, at the end, a 16-bit C.

 ok so looking at figure 1.1 of the RISC-V Spec, it says that the
48-bit encoding prefix is 'b011111.  that's 6 bits.  that only leaves
TEN bits total for use in this scheme, some of which need to be used
to say whether the opcode is 16-bit, 32-bit, or if the space to be
used

 so:
 xxxxxxxx 11 'b011111 = reserved, for standard 48-bit (or future use,
or something)
 xxxxxxxx 00 'b011111 = encoding for 16-bit C to follow
 xxxxxxxx 01 'b011111 = encoding for 32-bit op to follow
 0xXX xxxxxxxx 10 'b011111 = encoding for 16-bit op to follow however
there are 8+16 bits of prefix to play with

and for 64-bit, the prefix is 'b0111111 and would probably be best
used to go straight to a 7+16 bits of prefix plus a "reserved".

so in the 48-bit space that's *only* 8 bits for extension-prefixes!

example for 48-extending-32: 2 for rd, 2 for rs1/2/3, 2 for elwidth, 2
for... VL-override?

oh!  hang on.... something else just occurred to me: by having the
above alternative prefix encodings, it's possible to strip off (and
use) the bits from the standard 16-bit and 32-bit encoding.  that
means an extra 2 bits for a 16-bit op, and a full 5 bits for a 32-bit
op.  in the 32-bit case that's actually enough to be able to specify a
predicate (0 meaning "no predicate").

comprehensive! :)

l.