[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Thu Jan 17 21:31:19 GMT 2019

On Thu, Jan 17, 2019 at 10:20 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> On Thu, Jan 17, 2019, 01:39 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> wrote:
>
> > On Thu, Jan 17, 2019 at 6:12 AM Jacob Lifshay <programmerjake at gmail.com>

> > If we are going to dedicate more than 1 bit to unpredicated/predicated, we
> could use one of the combinations to represent a scalar instruction:
> 00: scalar
> 01: vector unpredicated
> 10: vector predicated with pr0
> 11: vector predicated with pr1
> we can decide which registers pr0 and pr1 mean later

 indicating that the predicate is inverted saves an instruction (and a
register) and allows parallel predicated-SIMD "if then else"
constructs, through using the same predicate register for both the
then and the else.

> > > We need to ensure that we won't need to use the 48/64-bit "standard"
> > > instructions with SV for that to work. I think it will work better to
> > have
> > > the same encoding represent the same instruction everytime, allowing us
> > to
> > > not need a pipeline flush each time we need the other instructions.
> >
> >  it won't be needed at all.  it's as if 48-bit "standard" instructions
> > became 49 bit (or 50 bit, or 51 bit).  you don't flush the pipeline
> > just because of that: you just insert the (hidden) bits into the
> > decode phase, just as if they had been loaded from the I-Cache.
> >
> >  there is absolutely no need - at all - to do a pipeline flush.  at all.
> >
> there is, when those extra instruction bits change, all following
> partially-decoded instructions need to be redecoded.

 ok i see where you're coming from: there's a dependency from the
(hidden) op-extender bits which are needed for the following
instruction.

 well... that's not that much different from (a) macro-op fusion (b)
what we're trying to do here (48/64-bit instructions), the only thing
being that the state is carried over to the next instruction.

 if you mean, as it's a CSR, it would be necessary to write to the CSR
memory area then read it back: that's solved by having a copy of the
state in the decode phase, and the CSR just happens to get updated on
a pipeline phase a bit later.

 honestly it's no more complex than doing a variable-length
instruction decode (16/32/48/64).  multi-issue will be a bundle of
fun, but that's ok.

> > > You forgot that the standard FP instructions already have a 16/32/64/128
> > > bit selector field that we can use.
> >
> >  oink??  section 13.2 V20181221-Public-Review-draft "a new supported
> > format is added to the format field of *MOST* instructions".
> >
> >  that's new.
> >
> >  the caveat is: *most* instructions.  not all.  going through the F D
> > and Q sections, "width" is specified in all... it's just that it's
> > only S D H and Q.  we can.... probably get away with that.
> >
> We can probably add the missing instructions as part of SV since they are
> probably the bitcasting moves from f16 to i16 and similar.

 and i16 arithmetic ops.  funct3 is what specifies the width (in
effect), and there's no space for extra stuff.  they use an OP-IMM-64
and OP-64 in RV128, i guess we could do the same, except define them
to mean OP-IMM-16 and OP-16 instead... it means "goodbye 2 custom
opcodes" but that's ok, they're not safe to reuse anyway (there's
supposed to be 4 major opcodes available, but nobody uses all 4
because 2 are semi-reserved for RV128).

 have to think about that.

> >  does leave integer without an elwidth.... is that such a great loss?
> > mm.... i'm not so sure it is.
> >
> for integer instructions, they don't have 4 arg fma instructions, so we can
> use the bits for extending the 4th arg's register field as elwidth
> override.

 i think i'm with you.  see below (about alternative meanings).

> we may want to add scalar/vector compare-branches as well.

 hmm hmm... yeah.  what did i do there... it's complicated... you have
to set up a predicate register as a target, and you also need a
predicate for (possibly) masking out the compares.  i handled this by
associating one predicate with src1 and another with src2.

 i'm getting the general impression that the range of options here
(different meanings for C, different for 32-bit, different for branch)
means that, really, i think we need to just "store" the prefix bits
and have them be decoded by the *following* instruction decode.

 in this way, it would be possible for the prefix bits to be
interpreted *differently*... depending on the instruction.

> > > I'm proposing that we only allow a single prefix and for the encoding
> > space
> > > that would be multiple prefixes in a row, we reassign it to other
> > > operations we will need.
> >
> >  you lost me :)  can you illustrate with an example?
> >
> if 0x1234 is a valid prefix and 0xABCD is a C instruction, then 0x1234
> 0xABCD means the prefixed version of 0xABCD, however, 0x1234 0x1234 0xABCD
> means something entirely independent of the previous 2, such as
> strided-ld/st.

 ok i'm with you, now.  interesting idea.

> >  or, just have a state machine which reads C opcodes, sets up some
> > "state" that is cleared after the next-instruction-but-one.
> >
> that might work, though that state will have to be preserved across
> interrupts and context switches, since there are 2 instructions in the
> prefixed sequence, we need to be able to trap in the middle for things like
> ld/st page faults and it's supervisor-visible that we've executed the
> prefix but not the prefixee.

 yyeahh... allowing that might not be a good idea: treating it "as if"
it was a 48/64-bit instruction (not allowing the trap to even occur)
might be sensible...

 although... honestly, the state would have to be preserved anyway.
except it now becomes... oh yep, got it, sorry took me a while to
catch on about context-switches :)

> That's part of why I prefer using 48/64-bit
> instructions instead.

 yep, that's much more explicit.  atomic decode required (and
guaranteed).   i concur.

 also, chances are high that that reserved C instruction will get
allocated to e.g. xBitManip.

 ok, so let's go back to the possibility of just storing the bits when
the 48-prefix 'b011111 occurs, that way we get 10 bits if there's one
prefix, 20 if 2 48-bit prefixes are used, 9 if a 64-bit prefix is
specified, 19 if a 64-bit prefix followed by a 48-bit prefix is used.

 then, leave 16 and 32-bit alone (as-is) and have the 9/10/19/20-bits
decoded by the part of the instruction decode engine that deals with
16/32 bit, as if the bits were attached to the 16/32-bit instruction.

 now we can do something similar to R-Type, S-Type (etc. etc.) in fact
we may even be able to just *use* the decode phase R/I/S/U-type to
interpret the extra 9/10/19/20-bits.

 let's drop elwidth for now.  overload OP-128 as "OP-16" is a good
way.  FP has the width field 16/32/64/128.

 10 bit case (and 32-bit ops)
 -----------

 2 for vector width: default / 2 / 3 / 4 (default means "use the CSR
VL which is usually set to 1 indicating scalar")

 R/I/S/B-type: 2 for rd, 2 for rs1/rs2.  that's 4.

 that leaves.... 6 bits for the predicate.  1 bit for invert, 5 bits
for x0-x31.  don't want a predicate? set it to "invert x0" which means
"predicate mask equals 0xffffffffffffffffffff" which means "don't have
a predicate".  (i'm not hugely keen on the idea of restricting the
predicate register to 1 or 2 dedicated regs... nobody's explained why
it's a good idea to me yet!)

 however for branches, the 6 bits _could_ be split in half (or so) to
be able to specify the src predicate and dest predicate.

10 bit case (and 16-bit ops)
-------------

2 for vector width again: default / 2 / 3 / 4.

2 for rd, 2 for rs1/rs2, however depending on vector width, push it up
to the MSBs.  so VL=1, it would be 00NNnnn, for VL=2 0NNnnn0, VL=3/4,
NNnnn00.

predicate... again... 6 bits.

those are basically... the same! ish.  question, do we want 3 bits for
VL? default / 2 / 3 / 4 / 5 / 6 / 7 / 8 ?  we have to drop one bit
from the predicate to do so.  is having a dedicated predicate register
a good idea?  i'm not keen on it, mostly because nobody's explained to
me why it's done.  i'm mostly not keen on it because it knocks one
register out of the "32 standard assembly conventions".

l.