[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Thu Jan 17 10:19:49 GMT 2019

On Thu, Jan 17, 2019, 01:39 Luke Kenneth Casson Leighton <lkcl at lkcl.net
wrote:

> On Thu, Jan 17, 2019 at 6:12 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> >
> > On Thu, Jan 10, 2019 at 12:34 AM Luke Kenneth Casson Leighton <
> lkcl at lkcl.net>
> > wrote:
> >
> > > On Thu, Jan 10, 2019 at 1:04 AM Jacob Lifshay <
> programmerjake at gmail.com>
> > > wrote:
> > >
> > > > I think that adding 16-bit instruction prefixes will be useful to
> encode
> > > > the high bits of the register numbers and extra bits for stuff like
> > > > selecting vectorization settings since those will change rapidly
> enough
> > > > that constantly writing to the rename table csrs may use more
> instruction
> > > > bandwidth.
> > >
> > >  darn it, i was hoping that wouldn't happen.
> > >
> > >  an alternative is that RVV has a way to set multiple settings at once,
> > >  using a pattern.  however SV is a bit more complicated.
> > >
> > >  another alternative is to have not just one set of CSR settings but
> > >  multiple of them, and allow bank-switching.
> > >
> > >
> > > > The encoding I was envisioning will change depending on the
> underlying
> > > > instruction.
> > > >
> > > > One of the important parts is that a prefixed 16-bit instruction
> fits in
> > > > the 32-bit custom space, a prefixed 32-bit instruction fits in the
> > > reserved
> > > > 48-bit space, and a prefixed 48-bit instruction fits in the 64-bit
> space.
> > > > This allows them to not conflict with other standard/custom
> instructions
> > > > allowing any instruction to be prefixed.
> > >
> > >  yes, this concept was discussed (i think) some time last year.
> > >  also, it means that Compressed (16-bit) instructions *also* get
> extended
> > >  to only 32-bit, whilst still keeping the prefixes.
> > >
> > >  however for extending the 16-bit C opcodes, they will need 4 extra
> > >  bits (per register) to extend to the full 128 regs.  we may end up
> using
> > > the
> > >  entire 48-bit opcode space, although C opcodes have less operands.
> > >
> > Since C opcodes are only used for compressing commonly used instructions
> > and we can use the full opcodes to access everything, we could have the
> > prefixed C opcodes only specify some of the registers. Since the prefixed
> > versions are going to always be vectorized, I think multiplying the
> > register number by 4 or 8 is a good idea, allowing us to use the extra
> > instruction bits to specify VL multipliers and other misc things.
>
>  let me think through if that's always true (or likely to be always
> true)... predicated scalar-ops: probably not common.  elwidth-adjusted
> scalar ops: *maybe* used a lot...
>

For the non-c instructions

> If we are going to dedicate more than 1 bit to unpredicated/predicated, we
could use one of the combinations to represent a scalar instruction:
00: scalar
01: vector unpredicated
10: vector predicated with pr0
11: vector predicated with pr1
we can decide which registers pr0 and pr1 mean later

>
> > >  oo, one idea is: on C, still use only 2 bits, and let it be the top 2
> > > bits.  so it's xx xxx 00 where xx is the 2-bit bank, xxx is the 3-bit
> > > reg num from the C instruction.
> > >
> > >  the same trick could hypothetically be applied to 32-bit, with say a
> > > single 0 in the bottom of the reg num.  the justification: if using
> > > this for vectorisation, the group of elements may be aligned on an
> > > even boundary (LSB=0) and for C on a "modulo 4 = 0" boundary
> > > LSBs='b00)
> > >
> > > the only issue there is, how do you access the upper registers as
> scalars?
> > >
> > I think that accessing the upper registers as scalars will be uncommon
> > enough that we can just set VL to 1 and use a vector instruction.
>
>  VL=1 is equivalent to scalar.  i'm leaning towards multiplying by 4 or 8.
>
> > >  the isa-mux scheme may be used to enable / disable the 48/64 prefix
> > > extension scheme, which would allow us to use the entire encoding
> > > space.  when this bank-prefixing scheme is disabled, the underlying
> > > 48/64-bit opcode space becomes "standard" again.
> > >
> > We need to ensure that we won't need to use the 48/64-bit "standard"
> > instructions with SV for that to work. I think it will work better to
> have
> > the same encoding represent the same instruction everytime, allowing us
> to
> > not need a pipeline flush each time we need the other instructions.
>
>  it won't be needed at all.  it's as if 48-bit "standard" instructions
> became 49 bit (or 50 bit, or 51 bit).  you don't flush the pipeline
> just because of that: you just insert the (hidden) bits into the
> decode phase, just as if they had been loaded from the I-Cache.
>
>  there is absolutely no need - at all - to do a pipeline flush.  at all.
>
there is, when those extra instruction bits change, all following
partially-decoded instructions need to be redecoded.

>
> > This will also make the compiler/debugger much simpler.
>
>  true... well, given that there aren't any 48 or 64 bit instructions,
> i'm not sure if we really need to care.
>
> > > * elwidth setting for FP is quite important.  it's the only way to get
> > > FP16 for example, and it's the only way to have the top 32-bits of a
> > > 64-bit FP register not be wasted (i.e. pack in 2 FP32 values).
> > >
> > You forgot that the standard FP instructions already have a 16/32/64/128
> > bit selector field that we can use.
>
>  oink??  section 13.2 V20181221-Public-Review-draft "a new supported
> format is added to the format field of *MOST* instructions".
>
>  that's new.
>
>  the caveat is: *most* instructions.  not all.  going through the F D
> and Q sections, "width" is specified in all... it's just that it's
> only S D H and Q.  we can.... probably get away with that.
>
We can probably add the missing instructions as part of SV since they are
probably the bitcasting moves from f16 to i16 and similar.

>
>  does leave integer without an elwidth.... is that such a great loss?
> mm.... i'm not so sure it is.
>
for integer instructions, they don't have 4 arg fma instructions, so we can
use the bits for extending the 4th arg's register field as elwidth
override. we may want to add scalar/vector compare-branches as well.

>
> > > 00 means "use the standard 5-bit regs".  that's wasteful of precious
> > > encoding space.  i'm reeeeasonably confident that we can think of a
> > > use for that.
> > >
> > On the other hand, it's really useful to be able to encode everything
> else
> > the prefix can do and use it with the standard 32 regs, allowing the
> > compiler to treat all the regs the same for vector operations. I would
> hate
> > to have to move data out of the lower 32 regs before we can use vector
> ops.
>
>  yeah this occurred to me afterwards, as well.  scratch that idea :)
>
>
> > > it occurs to me that multiple prefixes may be problematic for the
> > > instruction decode phase.  it's starting to get into CISC territory.
> > > how many prefixings would be needed (or permitted)?
> > >
> > I'm proposing that we only allow a single prefix and for the encoding
> space
> > that would be multiple prefixes in a row, we reassign it to other
> > operations we will need.
>
>  you lost me :)  can you illustrate with an example?
>
if 0x1234 is a valid prefix and 0xABCD is a C instruction, then 0x1234
0xABCD means the prefixed version of 0xABCD, however, 0x1234 0x1234 0xABCD
means something entirely independent of the previous 2, such as
strided-ld/st.

>
> > > oh!  hang on.... something else just occurred to me: by having the
> > > above alternative prefix encodings, it's possible to strip off (and
> > > use) the bits from the standard 16-bit and 32-bit encoding.  that
> > > means an extra 2 bits for a 16-bit op, and a full 5 bits for a 32-bit
> > > op.  in the 32-bit case that's actually enough to be able to specify a
> > > predicate (0 meaning "no predicate").
> > >
> > Actually, 16-bit ops use all their bits, there are not any constant bits
> > that we can reassign.
>
>  yeah i realised this once i'd looked more closely at the instruction
> listing tables: those 2 bits are used to specify which quadrant is to
> be used.
>
>  it occurred to me yesterday that perhaps using the reserved opcode
> {15-13=0b100} {12 free bits} {1-0=0b00} may actually be a better
> all-round option here, on the basis that we get 2 more bits than if
> using a 48-bit instruction encoding, and if it's a "C" encoding we can
> do a type of "macro-op fusion".
>
>  or, just have a state machine which reads C opcodes, sets up some
> "state" that is cleared after the next-instruction-but-one.
>
that might work, though that state will have to be preserved across
interrupts and context switches, since there are 2 instructions in the
prefixed sequence, we need to be able to trap in the middle for things like
ld/st page faults and it's supervisor-visible that we've executed the
prefix but not the prefixee. That's part of why I prefer using 48/64-bit
instructions instead.

>
> >  32-bit ops have the 2 LSB bits that we can reassign.
>
>  ... which are set to 0b11.  it's better than nothing :)
>
> l.
>
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>