[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Thu Jan 31 20:44:34 GMT 2019

On Thu, Jan 31, 2019, 07:53 Luke Kenneth Casson Leighton <lkcl at lkcl.net
wrote:

> On Thu, Jan 31, 2019 at 8:09 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> > we could also use some method to encode sign/zero extension for scalar
> > vl-mul=1 results (have 2 vl-mul=1 encodings in vlp?).
>
>  unlike elwidth, zero/sign extending bit comes straight from the
> opcode, in all cases.  surprisingly, if ".W" it's sign-extend (i had
> to do a full audit for SV-orig).
>
I don't think that will work for the same reason that we can't use OP/OP-32
as a bit of elwidth:
if you want a sign-extending scalar xor.b:
xor.b.sss rd, rs1, rs2, len=1; rd = (int64_t)(int8_t)((uint8_t)rs1 ^
(uint8_t)rs2)
you would have to use the reserved encoding in OP-32 if using your scheme.
So, either we decide that using the reserved encodings in OP-32 is ok, in
which case we could equally use OP/OP-32 as one bit of elwidth, or using
the reserved encodings is not ok, in which case we should reserve space in
vlp for sign/zero extension of scalars with vl-mul=1 (recommended since we
can save space by not needing separate encodings when vl-mul > 1).

>
> > for scalar with vl-mul > 1:
> > packed like vector with same vl-mul and VL=1, but the padding from the
> end
> > of the vector to the end of the register should be filled with
> > zeros/sign-extended (not recommended)/nan-boxed:
> >
> > so:
> > li x12, 0x0123_4567_89AB_CDEF
> > li x16, 0x0182_0304
> > li x20, 0x1122_3344
> > add.b.sss x12, x16, x20, vl-mul=3
> > sets x12 to:
> > 0xA4_3648 for zero extension
> > 0xFFFF_FFFF_FFA4_3648 for sign extension (not recommended)
> > adding the 3 lsb bytes and sign or zero extending the result
> > note that x13 is not modified
> >
> > I think sign extension of vectors is too expensive (need to extend from
> > every byte)
>
> it really should not be that bad.
>
Ok. I still can't think of any reason to sign-extend since we would be
almost always accessing it as a vector of vl-mul elements so
sign/zero/nan-boxing shouldn't matter. not implementing sign extension from
every byte still saves both some gates and instruction encoding space
(combining with vlp).

We may still need a lot of the gates for sign-extending conversions from
vectors of i8 to i16/i32/i64 vectors though.

>
> > so I recommend requiring zero extension/nan-boxing for vl-mul >
> > 1.
>
> > basically it's as if vl-mul > 1 makes vectors-with-length-VL/scalars of
> > vectors with length vl-mul:
> > Pseudo LLVM IR:
> > <VL x <vl-mul x float>>
> >
> >
> > > Or, if that extra elwidth bit (or two) is needed.
> > >
> > Since we can't use OP/OP-32 as 1 bit of elwidth (because of possible
> future
> > standard extensions that use OP/OP-32 to change more than operation
> > bit-width), I think we will need 2 elwidth bits.
> >
> > >
> > > What I would like to advocate is that scalar regs not be altered,
> whether
> > > src or dest, from standard RV behaviour.
> > >
> > > And that it is Vector regs (when the reg prefix bit is 1) that have the
> > > altered width behaviour.
> > >
> > > So, a FP16 FADD of a prefixed-scalar to a vector would, if stored in a
> > > scalar x1-x31, result in NaN boxing to the full 64 bits, however if the
> > > dest was a Vector it would NOT be boxed, only the actual FP16 would go
> into
> > > the regfile, NOT setting an additional 48 bits to all 1s.
> > >
> > For vector rd, I agree that elements past VL should be left unchanged. It
> > will make it more difficult for register-renaming/tomasulo
> implementations
> > though since they will need to read from rd for unpredicated cases (they
> > need to read from rd for predicated cases anyway).
>
>  it was complicated as hell, however i managed to create a workable
> register scheme that did not involve overwriting (or even reading) of
> the end of a register file entry.
>
Ok.

>
>  it requires byte-level write-enable lines, dividing the register file
> into 32-bit banks, having pairs of those banks shared across 2
> 32-bit-wide Function Units, and a separate set of Function Units with
> 8/16-bit ALUs.
>
I think we should probably just use SIMD-like micro-ops for elwidth < 32
since we can still pass a separate byte-write-mask. That way we can save on
FU count, saving both area and power. The extra energy required by the ALUs
should be negligible compared to the extra area, complexity, and power used
by having dedicated 8/16-bit FUs. the 32-bit FUs will just pass the elwidth
as another input to the ALU. If we have each 32-bit alu write to a separate
register bank, like originally planned, we won't be able to repack 8/16-bit
operations in almost all cases anyway since the active lanes would be the
lsb lanes. we would have to match an op with the lsb lanes predicated off
(uncommon) to be able to repack.

we will still need 64-bit FUs unless you think implementing matched pairs
of 32-bit FUs is reasonable.

Jacob