[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Jacob Lifshay programmerjake at gmail.com
Fri Feb 1 08:35:56 GMT 2019


so, I'm going to update the instruction prefixes doc. Should I make a copy
of the old proposal in git or is git history sufficient?

On Thu, Jan 31, 2019, 12:44 Jacob Lifshay <programmerjake at gmail.com wrote:

> On Thu, Jan 31, 2019, 07:53 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> wrote:
>
>> On Thu, Jan 31, 2019 at 8:09 AM Jacob Lifshay <programmerjake at gmail.com>
>> wrote:
>> > we could also use some method to encode sign/zero extension for scalar
>> > vl-mul=1 results (have 2 vl-mul=1 encodings in vlp?).
>>
>>  unlike elwidth, zero/sign extending bit comes straight from the
>> opcode, in all cases.  surprisingly, if ".W" it's sign-extend (i had
>> to do a full audit for SV-orig).
>>
> I don't think that will work for the same reason that we can't use
> OP/OP-32 as a bit of elwidth:
> if you want a sign-extending scalar xor.b:
> xor.b.sss rd, rs1, rs2, len=1; rd = (int64_t)(int8_t)((uint8_t)rs1 ^
> (uint8_t)rs2)
> you would have to use the reserved encoding in OP-32 if using your scheme.
> So, either we decide that using the reserved encodings in OP-32 is ok, in
> which case we could equally use OP/OP-32 as one bit of elwidth, or using
> the reserved encodings is not ok, in which case we should reserve space in
> vlp for sign/zero extension of scalars with vl-mul=1 (recommended since we
> can save space by not needing separate encodings when vl-mul > 1).
>
>>
>> > for scalar with vl-mul > 1:
>> > packed like vector with same vl-mul and VL=1, but the padding from the
>> end
>> > of the vector to the end of the register should be filled with
>> > zeros/sign-extended (not recommended)/nan-boxed:
>> >
>> > so:
>> > li x12, 0x0123_4567_89AB_CDEF
>> > li x16, 0x0182_0304
>> > li x20, 0x1122_3344
>> > add.b.sss x12, x16, x20, vl-mul=3
>> > sets x12 to:
>> > 0xA4_3648 for zero extension
>> > 0xFFFF_FFFF_FFA4_3648 for sign extension (not recommended)
>> > adding the 3 lsb bytes and sign or zero extending the result
>> > note that x13 is not modified
>> >
>> > I think sign extension of vectors is too expensive (need to extend from
>> > every byte)
>>
>> it really should not be that bad.
>>
> Ok. I still can't think of any reason to sign-extend since we would be
> almost always accessing it as a vector of vl-mul elements so
> sign/zero/nan-boxing shouldn't matter. not implementing sign extension from
> every byte still saves both some gates and instruction encoding space
> (combining with vlp).
>
> We may still need a lot of the gates for sign-extending conversions from
> vectors of i8 to i16/i32/i64 vectors though.
>
>>
>> > so I recommend requiring zero extension/nan-boxing for vl-mul >
>> > 1.
>>
>> > basically it's as if vl-mul > 1 makes vectors-with-length-VL/scalars of
>> > vectors with length vl-mul:
>> > Pseudo LLVM IR:
>> > <VL x <vl-mul x float>>
>> >
>> >
>> > > Or, if that extra elwidth bit (or two) is needed.
>> > >
>> > Since we can't use OP/OP-32 as 1 bit of elwidth (because of possible
>> future
>> > standard extensions that use OP/OP-32 to change more than operation
>> > bit-width), I think we will need 2 elwidth bits.
>> >
>> > >
>> > > What I would like to advocate is that scalar regs not be altered,
>> whether
>> > > src or dest, from standard RV behaviour.
>> > >
>> > > And that it is Vector regs (when the reg prefix bit is 1) that have
>> the
>> > > altered width behaviour.
>> > >
>> > > So, a FP16 FADD of a prefixed-scalar to a vector would, if stored in a
>> > > scalar x1-x31, result in NaN boxing to the full 64 bits, however if
>> the
>> > > dest was a Vector it would NOT be boxed, only the actual FP16 would
>> go into
>> > > the regfile, NOT setting an additional 48 bits to all 1s.
>> > >
>> > For vector rd, I agree that elements past VL should be left unchanged.
>> It
>> > will make it more difficult for register-renaming/tomasulo
>> implementations
>> > though since they will need to read from rd for unpredicated cases (they
>> > need to read from rd for predicated cases anyway).
>>
>>  it was complicated as hell, however i managed to create a workable
>> register scheme that did not involve overwriting (or even reading) of
>> the end of a register file entry.
>>
> Ok.
>
>>
>>  it requires byte-level write-enable lines, dividing the register file
>> into 32-bit banks, having pairs of those banks shared across 2
>> 32-bit-wide Function Units, and a separate set of Function Units with
>> 8/16-bit ALUs.
>>
> I think we should probably just use SIMD-like micro-ops for elwidth < 32
> since we can still pass a separate byte-write-mask. That way we can save on
> FU count, saving both area and power. The extra energy required by the ALUs
> should be negligible compared to the extra area, complexity, and power used
> by having dedicated 8/16-bit FUs. the 32-bit FUs will just pass the elwidth
> as another input to the ALU. If we have each 32-bit alu write to a separate
> register bank, like originally planned, we won't be able to repack 8/16-bit
> operations in almost all cases anyway since the active lanes would be the
> lsb lanes. we would have to match an op with the lsb lanes predicated off
> (uncommon) to be able to repack.
>
> we will still need 64-bit FUs unless you think implementing matched pairs
> of 32-bit FUs is reasonable.
>
> Jacob
>


More information about the libre-riscv-dev mailing list