[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Thu Jan 31 15:44:02 GMT 2019

On Thu, Jan 31, 2019 at 8:09 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Wed, Jan 30, 2019, 23:18 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> wrote:
>
> > Ok so moving on to scalar-vector, in SV original, a bit in the CSRs
> > specifies whether the register is scalar or vector, and 2 more bits specify
> > the elwidth override.
> >
> > When elwidth is overridden, even for scalar ONLY the parts of the physical
> > regfile up to the elwidth are read (or written).
> >
> > So if elwidth is 8, only the LSByte of the regfile record for that register
> > is read/written.
> >
> > If however elwidth is default, and LD.B is used, you get the standard
> > behaviour: 1 byte read but it is zero-extended to 64 bits.
> >
> > We need some rules for SVprefix, in the extremely limited available bits.
> >
> > We so far agree that 1 bit be used as a prefix to regnums. 0 means scalar,
> > can't recall if that means x0-x31. 1 means vector, with bottom 2 bits being
> > 0 and next 5 bits being the rs/rd 5 bits.
> >
> I had had the scalar/vector bit inverted, but that doesn't matter. Scalar
> does mean x0-x31 or f0-f31. I think we should treat x0 (but not f0)
> specially to mean all zeros, so even if rs1 is a vector, rs1 being x0 would
> mean that the input was all zero.
>
> >
> > Elwidth to be taken from standard RV OP, no problem there.
> >
> > However we need to define whether the scalar elements should be zero/sign
> > extended or if they should be compressed together, and likewise for vector.
>
> I suggest compressed when vl-mul isn't 1, otherwise
> zero-extended/sign-extended/nan-boxed.
>
> so for scalar with vl-mul=1:
> i8: zero extended (u8 is more common than s8)
> i16: sign extended
> i32: sign extended
> i64: sign extended
> f16/f32/f64: nan-boxed
>
> we could also use some method to encode sign/zero extension for scalar
> vl-mul=1 results (have 2 vl-mul=1 encodings in vlp?).

 unlike elwidth, zero/sign extending bit comes straight from the
opcode, in all cases.  surprisingly, if ".W" it's sign-extend (i had
to do a full audit for SV-orig).

> for scalar with vl-mul > 1:
> packed like vector with same vl-mul and VL=1, but the padding from the end
> of the vector to the end of the register should be filled with
> zeros/sign-extended (not recommended)/nan-boxed:
>
> so:
> li x12, 0x0123_4567_89AB_CDEF
> li x16, 0x0182_0304
> li x20, 0x1122_3344
> add.b.sss x12, x16, x20, vl-mul=3
> sets x12 to:
> 0xA4_3648 for zero extension
> 0xFFFF_FFFF_FFA4_3648 for sign extension (not recommended)
> adding the 3 lsb bytes and sign or zero extending the result
> note that x13 is not modified
>
> I think sign extension of vectors is too expensive (need to extend from
> every byte)

it really should not be that bad.

> so I recommend requiring zero extension/nan-boxing for vl-mul >
> 1.

> basically it's as if vl-mul > 1 makes vectors-with-length-VL/scalars of
> vectors with length vl-mul:
> Pseudo LLVM IR:
> <VL x <vl-mul x float>>
>
>
> > Or, if that extra elwidth bit (or two) is needed.
> >
> Since we can't use OP/OP-32 as 1 bit of elwidth (because of possible future
> standard extensions that use OP/OP-32 to change more than operation
> bit-width), I think we will need 2 elwidth bits.
>
> >
> > What I would like to advocate is that scalar regs not be altered, whether
> > src or dest, from standard RV behaviour.
> >
> > And that it is Vector regs (when the reg prefix bit is 1) that have the
> > altered width behaviour.
> >
> > So, a FP16 FADD of a prefixed-scalar to a vector would, if stored in a
> > scalar x1-x31, result in NaN boxing to the full 64 bits, however if the
> > dest was a Vector it would NOT be boxed, only the actual FP16 would go into
> > the regfile, NOT setting an additional 48 bits to all 1s.
> >
> For vector rd, I agree that elements past VL should be left unchanged. It
> will make it more difficult for register-renaming/tomasulo implementations
> though since they will need to read from rd for unpredicated cases (they
> need to read from rd for predicated cases anyway).

 it was complicated as hell, however i managed to create a workable
register scheme that did not involve overwriting (or even reading) of
the end of a register file entry.

 it requires byte-level write-enable lines, dividing the register file
into 32-bit banks, having pairs of those banks shared across 2
32-bit-wide Function Units, and a separate set of Function Units with
8/16-bit ALUs.

l.