[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Jacob Lifshay programmerjake at gmail.com
Fri Jan 18 00:05:12 GMT 2019


On Thu, Jan 17, 2019, 13:32 Luke Kenneth Casson Leighton <lkcl at lkcl.net
wrote:

> On Thu, Jan 17, 2019 at 10:20 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > On Thu, Jan 17, 2019, 01:39 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> > wrote:
> >
> > > On Thu, Jan 17, 2019 at 6:12 AM Jacob Lifshay <
> programmerjake at gmail.com>
>
> > > If we are going to dedicate more than 1 bit to
> unpredicated/predicated, we
> > could use one of the combinations to represent a scalar instruction:
> > 00: scalar
> > 01: vector unpredicated
> > 10: vector predicated with pr0
> > 11: vector predicated with pr1
> > we can decide which registers pr0 and pr1 mean later
>
>  indicating that the predicate is inverted saves an instruction (and a
> register) and allows parallel predicated-SIMD "if then else"
> constructs, through using the same predicate register for both the
> then and the else.
>
Yeah, I accidentally missed the inverted predicate case.

>
> > > > We need to ensure that we won't need to use the 48/64-bit "standard"
> > > > instructions with SV for that to work. I think it will work better to
> > > have
> > > > the same encoding represent the same instruction everytime, allowing
> us
> > > to
> > > > not need a pipeline flush each time we need the other instructions.
> > >
> > >  it won't be needed at all.  it's as if 48-bit "standard" instructions
> > > became 49 bit (or 50 bit, or 51 bit).  you don't flush the pipeline
> > > just because of that: you just insert the (hidden) bits into the
> > > decode phase, just as if they had been loaded from the I-Cache.
> > >
> > >  there is absolutely no need - at all - to do a pipeline flush.  at
> all.
> > >
> > there is, when those extra instruction bits change, all following
> > partially-decoded instructions need to be redecoded.
>
>  ok i see where you're coming from: there's a dependency from the
> (hidden) op-extender bits which are needed for the following
> instruction.
>
>  well... that's not that much different from (a) macro-op fusion (b)
> what we're trying to do here (48/64-bit instructions), the only thing
> being that the state is carried over to the next instruction.
>
>  if you mean, as it's a CSR, it would be necessary to write to the CSR
> memory area then read it back: that's solved by having a copy of the
> state in the decode phase, and the CSR just happens to get updated on
> a pipeline phase a bit later.
>
>  honestly it's no more complex than doing a variable-length
> instruction decode (16/32/48/64).  multi-issue will be a bundle of
> fun, but that's ok.
>
> > > > You forgot that the standard FP instructions already have a
> 16/32/64/128
> > > > bit selector field that we can use.
> > >
> > >  oink??  section 13.2 V20181221-Public-Review-draft "a new supported
> > > format is added to the format field of *MOST* instructions".
> > >
> > >  that's new.
> > >
> > >  the caveat is: *most* instructions.  not all.  going through the F D
> > > and Q sections, "width" is specified in all... it's just that it's
> > > only S D H and Q.  we can.... probably get away with that.
> > >
> > We can probably add the missing instructions as part of SV since they are
> > probably the bitcasting moves from f16 to i16 and similar.

 and i16 arithmetic ops.


We were going to use the 2-bit rs4 specifier's bits to specify
i8/i16/default (32 or 64 depending on if it's a W instruction) since funct3
specifies which op (add/sub/xor/etc.).

I think we should instead specify 1 bit as sign-extended/zero-extended and
the other as switching W/non-W from 32/64 to 8/16.

> funct3 is what specifies the width (in
> effect), and there's no space for extra stuff.  they use an OP-IMM-64
> and OP-64 in RV128, i guess we could do the same, except define them
> to mean OP-IMM-16 and OP-16 instead... it means "goodbye 2 custom
> opcodes" but that's ok, they're not safe to reuse anyway (there's
> supposed to be 4 major opcodes available, but nobody uses all 4
> because 2 are semi-reserved for RV128).
>
>  have to think about that.
>
>
> > >  does leave integer without an elwidth.... is that such a great loss?
> > > mm.... i'm not so sure it is.
> > >
> > for integer instructions, they don't have 4 arg fma instructions, so we
> can
> > use the bits for extending the 4th arg's register field as elwidth
> > override.
>
>  i think i'm with you.  see below (about alternative meanings).
>
>
> > we may want to add scalar/vector compare-branches as well.
>
>  hmm hmm... yeah.  what did i do there... it's complicated... you have
> to set up a predicate register as a target, and you also need a
> predicate for (possibly) masking out the compares.  i handled this by
> associating one predicate with src1 and another with src2.
>
I was thinking it might be better to have the prefix specify the dest
predicate using 1 bit from rd and the other rd bit be wether to branch on
all/any unmasked lanes being true.

>
>  i'm getting the general impression that the range of options here
> (different meanings for C, different for 32-bit, different for branch)
> means that, really, i think we need to just "store" the prefix bits
> and have them be decoded by the *following* instruction decode.
>
If the prefix is a separate instruction, storing the prefix bits is
probably best.

>
>  in this way, it would be possible for the prefix bits to be
> interpreted *differently*... depending on the instruction.
>
> > > > I'm proposing that we only allow a single prefix and for the encoding
> > > space
> > > > that would be multiple prefixes in a row, we reassign it to other
> > > > operations we will need.
> > >
> > >  you lost me :)  can you illustrate with an example?
> > >
> > if 0x1234 is a valid prefix and 0xABCD is a C instruction, then 0x1234
> > 0xABCD means the prefixed version of 0xABCD, however, 0x1234 0x1234
> 0xABCD
> > means something entirely independent of the previous 2, such as
> > strided-ld/st.
>
>  ok i'm with you, now.  interesting idea.
>
> > >  or, just have a state machine which reads C opcodes, sets up some
> > > "state" that is cleared after the next-instruction-but-one.
> > >
> > that might work, though that state will have to be preserved across
> > interrupts and context switches, since there are 2 instructions in the
> > prefixed sequence, we need to be able to trap in the middle for things
> like
> > ld/st page faults and it's supervisor-visible that we've executed the
> > prefix but not the prefixee.
>
>  yyeahh... allowing that might not be a good idea: treating it "as if"
> it was a 48/64-bit instruction (not allowing the trap to even occur)
> might be sensible...
>
>  although... honestly, the state would have to be preserved anyway.
> except it now becomes... oh yep, got it, sorry took me a while to
> catch on about context-switches :)
>
> > That's part of why I prefer using 48/64-bit
> > instructions instead.
>
>  yep, that's much more explicit.  atomic decode required (and
> guaranteed).   i concur.
>
>  also, chances are high that that reserved C instruction will get
> allocated to e.g. xBitManip.
>
>
>  ok, so let's go back to the possibility of just storing the bits when
> the 48-prefix 'b011111 occurs, that way we get 10 bits if there's one
> prefix, 20 if 2 48-bit prefixes are used, 9 if a 64-bit prefix is
> specified, 19 if a 64-bit prefix followed by a 48-bit prefix is used.

I think it's a good idea to specify that we always use the N-bit prefix for
instructions that are N-bits long since that way it follows the risc-v
instruction length spec.

So, you could, if you chose, encode the 48-bit prefix around a 32-bit
prefix around a 16-bit instruction, but you couldn't encode more than one
of each length of prefix since then the instruction length portion of the
instruction would be incorrect.

However, I still think that we should only allow 1 prefix per instruction
to allow reallocating the multi-prefix encodings to other useful
instructions.

>
>  then, leave 16 and 32-bit alone (as-is) and have the 9/10/19/20-bits
> decoded by the part of the instruction decode engine that deals with
> 16/32 bit, as if the bits were attached to the 16/32-bit instruction.
>
>  now we can do something similar to R-Type, S-Type (etc. etc.) in fact
> we may even be able to just *use* the decode phase R/I/S/U-type to
> interpret the extra 9/10/19/20-bits.
>
>  let's drop elwidth for now.  overload OP-128 as "OP-16" is a good
> way.  FP has the width field 16/32/64/128.
>
We will still probably need 8-bit int vectors for video decode. They will
also be useful for accelerating strlen and memcpy and friends.

>
>  10 bit case (and 32-bit ops)
>  -----------
>
>  2 for vector width: default / 2 / 3 / 4 (default means "use the CSR
> VL which is usually set to 1 indicating scalar")
>
>  R/I/S/B-type: 2 for rd, 2 for rs1/rs2.  that's 4.
>
>  that leaves.... 6 bits for the predicate.  1 bit for invert, 5 bits
> for x0-x31.  don't want a predicate? set it to "invert x0" which means
> "predicate mask equals 0xffffffffffffffffffff" which means "don't have
> a predicate".  (i'm not hugely keen on the idea of restricting the
> predicate register to 1 or 2 dedicated regs... nobody's explained why
> it's a good idea to me yet!)
>
>  however for branches, the 6 bits _could_ be split in half (or so) to
> be able to specify the src predicate and dest predicate.
>
> 10 bit case (and 16-bit ops)
> -------------
>
> 2 for vector width again: default / 2 / 3 / 4.
>
> 2 for rd, 2 for rs1/rs2, however depending on vector width, push it up
> to the MSBs.  so VL=1, it would be 00NNnnn, for VL=2 0NNnnn0, VL=3/4,
> NNnnn00.
>
> predicate... again... 6 bits.
>
>
> those are basically... the same! ish.  question, do we want 3 bits for
> VL? default / 2 / 3 / 4 / 5 / 6 / 7 / 8 ?

I think we should have VL multipliers rather than VL and a bunch of fixed
lengths since then we can express float(vec1)/vec2/vec3/vec4 operations
from Vulkan without needing to change the VL register, since vec2/vec3/vec4
combined are probably more likely to occur than float.

Where we have VL scaled, I think we should replicate the predication bits
by the same amount:
VL=3
s1=0x5
f40-f46={a,b,c,d,e,f,g,h,i,j,k,l} where all vars are f32

fadd f40, f40, f40, pred=~s1, len=VL*4
does:
if(~s1 & 1)
{
    a += a;
    b += b;
    c += c;
    d += d;
}
if(~s1 & 2)
{
    e += e;
    f += f;
    g += g;
    h += h;
}
if(~s1 & 4)
{
    i += i;
    j += j;
    k += k;
    l += l;
}

fadd f40, f40, f40, pred=~s1, len=4
does:
if(~s1 & 1)
{
    a += a;
}
if(~s1 & 2)
{
    b += b;
}
if(~s1 & 4)
{
    c += c;
}
if(~s1 & 8)
{
    d += d;
}

If we have 4 options, then we could have:
00: len=VL
01: len=VL*2
10: len=VL*3
11: len=VL*4

If we have 8 options, we could have:
000: len=1 (unspecified if we get scalar ops some other way, such as the
predicate field)
001: len=VL
010: len=VL*2
011: len=VL*3
100: len=VL*4
101: len=VL*8
110: len=VL*16
111: len=4 (for SIMD from legacy code, like WebAssembly)

For 16 options, we could have:
0x0: len=1 (unspecified if we get scalar ops some other way)
0x1: len=VL
0x2: len=VL*2
0x3: len=VL*3
0x4: len=VL*4
0x5: len=VL*5 (used in SHA1 and maybe SHA3)
0x6: len=VL*6 (2x3 and 3x2 matrix)
0x7: len=VL*25 (used in SHA3)
0x8: len=VL*8 (2x4 and 4x2 matrix and OpenCL vec8)
0x9: len=VL*9 (3x3 matrix)
0xA: unspecified
0xB: unspecified
0xC: len=VL*12 (3x4 and 4x3 matrix)
0xD: len=4 (legacy code)
0xE: len=8 (legacy code)
0xF: len=VL*16 (4x4 matrix and OpenCL vec16)
where we can assign the unspecified options later.

> we have to drop one bit
> from the predicate to do so.  is having a dedicated predicate register
> a good idea?  i'm not keen on it, mostly because nobody's explained to
> me why it's done.  i'm mostly not keen on it because it knocks one
> register out of the "32 standard assembly conventions".

Having one or a few registers be available as predicate sources saves
encode space. both the V extension and AVX512 do that. The predicate
doesn't change often enough that we will need more than a few predicate
source registers. Using one particular register as the predicate source
doesn't mean it can't be used for other things when you don't need
predication.

For the Vulkan implementation, we are going to mostly have a single
predicate per basic block.

If you pick which registers to use properly, you can manipulate it using C
instructions and I suggest picking x0, s1, a4, and a5 or just x0 and s1. I
am picking by limiting myself to the 8 addressable by C instructions, and
avoiding the lower numbered argument registers since they are likely to be
needed to pass arguments. I forgot why I'm avoiding s0, want to say there
is an ABI conflict of some sort. I think using something from both the
caller and callee-saved registers is a good idea, hence s1/a4/a5. I picked
s1 instead of a4/a5 since that way we don't have to save/restore it when
calling external code (like sinf, memcpy, etc.).

>
> l.
>
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>


More information about the libre-riscv-dev mailing list