[libre-riscv-dev] [isa-dev] Re: SV / RVV, marking a register as VL.

Jacob Lifshay programmerjake at gmail.com
Mon Sep 2 13:04:17 BST 2019


On Mon, Sep 2, 2019 at 1:52 AM lkcl <luke.leighton at gmail.com> wrote:
>
> On Monday, September 2, 2019 at 3:32:01 AM UTC+1, Jacob Lifshay wrote:
>
>>
>> What about taking the vector/scalar propagation idea and just apply it
>> to normal instructions (outside of VBLOCK):
>
>
> i like it.  so let's do a comparison of the resource utilisation of each, and how it would work.
>
> the premise of the VBLOCK-Prefix idea: up to 2 SVPrefixes apply to the first (and second) instruction.  the prefixes do not specify the register *numbers* (unlike SVOrig), they specify "src1 is a scalar/vector" and the actual register number is picked up from the [first] instruction.
>
> reminder: the cascade basically says that any register number which is "marked" as vectorised from the first [and second] instruction, those register numbers *remain* marked as vectorised and, in subsequent instructions, if the same register number is further used (as a src) it *remains marked as a vector*.  furthermore, the result register from the subsequent instructions *also get marked as vectorised* and so on.  hence a "cascade".
>
> The Mill Architecture also implements this concept, so it is not a new idea.  The Mill however starts from LDs.  only "LD" instructions actually specify the "type" and bit-width of the operand!  all arithmetic operations are subsequently polymorphic (there is no ADD.W, there is only ADD.  there is no FADD, there is only ADD), and consequently the instruction encoding is extremely compact.  it's very cool.
>
>
> as branches [except back to the start] and function calls are prohibited within the VBLOCK, there are guarantees that the order in which the compiler can calculate which registers will cascade through is completely static.
>
> if the same idea was applied *outside* of a VBLOCK, there would be no such guarantee.  hmm, so that would need to be thought through.

I am proposing that the vector/scalar flags would cascade throughout
the whole function/multiple functions, which can be handled in the
compiler with a rather simple data-flow analysis. You could think of
it as tagging the registers with a vector/scalar flag.
>
>
>> there would be a 128-bit CSR (64-bit on RVE) with a 4-bit field for
>> each register field encoding [1], tentatively named SVMODES. It is
>> allocated to normal CSRs in the same manner as the time CSR -- by
>> splitting into XLEN-sized chunks.
>
>
> oh hang on - 128/4 is 32, are you suggesting a table-entry *per register*? the way that VBLOCK-SVPrefix works is that the register numbers are taken from the following instructions, and cascade onwards.  which saves a huge amount of CSR space.
>
> i would be extremely surprised if more than 4 possibly 8 sets of prefixing was needed, given the cascading.  marking the *entire* set of registers in advance, there is no need for the cascade (at all).
>
> reading further below i am guessing you're probably not.  this would reduce the required size of the CSR signiificantly.
>
> let's assume a max of 4 regs: rd, rs1, rs2, rs3.  that would be 4x4=16 bits.  let's assume say... 4 sets of those (up to 4 instructions in sequence may be "marked").  that's 64 bits.  if you wanted to cover up to 8 instructions, yes that would be 128 biits.
>
> so the cost (number of instructions) needed to set up the CSRs would be:
>
> * li x3, immediate # 64-bit immediate: a LD?
> * CSRRW CSR1, x3
> * li x3, immediate # another 64-bit immediate LD
> * CSRRW CSR2, x3
>
> if LD is used for the load-immediate, that's still a whopping *128* bits worth of setup, with other aspects of SVPrefix (extending the register numbers beyond 32, being able to use different predicate registers on different instructions) still needs to be included.
>
> by contrast, the VBLOCK prefix is only 16 bits, and whilst only up to 2 instructions may be "marked", only between 16 to 64 bits is used to do so, *and* it covers setting of VL/MVL, and targets multiple predicate registers, and so on.
>
>
>> [1]: not just register number, other extensions may have a register
>> field encoding to register number translation table, such as SVorig.
>>
>> each field would be encoded as follows:
>> bits 0-1:
>>     00: SUBVL=1
>>     01: SUBVL=2
>>     10: SUBVL=3
>>     11: SUBVL=4
>
>
> mmm... SUBVL is supposed to be "global", however... what would be implied here is that SUBVL would be applied *per register*.  exactly how that would work would need to be thought through.
>
> [oh, ok, you cover it, below].
>
>>
>>
>> bits 2-3:
>>     00: scalar
>>     01: vector unpredicated
>>     1x: vector (predication -- TBD)
>
>
> element-width overrides are also necessary (2 bits to specify default/8/16/32).  that would i feel be better than changing the semantics of SUBVL (from a global to a per-register).
>
>>
>> the 4-bit field corresponding to rd would be written with the vector
>> mode of the result of each instruction, which is calculated from the
>> instruction and the vector modes as follows:
>> 1. If the instruction specifies scalar/vector mode in the encoding
>> (like via SVPrefix), then the mode specified by the instruction is
>> used.
>> 2. Otherwise,
>>     a. the scalar/vector mode is calculated for each of the source
>> operands by reading from the SVMODES field corresponding to the
>> register field encoding for that source operand.
>
>
> ... and cascades from there.
>
>>
>>     b. If the SUBVL modes from the source operands don't match and the
>> instruction doesn't specifically handle differing SUBVL modes, then
>> trigger an illegal instruction trap (swizzle can handle diffferent
>> SUBVL modes, most other instructions can't)
>
>
> this is where it gets wasteful.  the number of permutations that raise illegal instruction traps is so high that it suggests that the encoding is not a good one.  i feel it would be better to use the 2 bits for elwidth.

I think that SUBVL will actually be more useful there than elwidth,
though those both will be very useful. The majority of graphics
shaders use only i32/u32 and f32, whereas most of them use a range of
SUBVL values.

>
>>         2. The result predication is TBD (probably selected similarly
>> to SUBVL selection).
>
>
> twin-predication also needs to be thought through.  i think it would work though.
>
>>
>> There would be a separate SVMODES csr used for context switching, on a
>> trap, the old value would be saved (by switching which SVMODES csr is
>> used or by copying to a different csr), then the SVMODES csr would be
>> cleared to all zero (all registers scalar with SUBVL=1).
>
>
> 128 bits worth of context-saving...

not too bad since, if we design it right, saving/restoring SVMODES can
be skipped for most system calls, it would only need to be
saved/restored for context switches between processes.

>
>>
>>
>> on return from an exception, SVMODES would be restored (by copying
>> back from the saved csr or by switching which SVMODES csr is used).
>>
>> There would be a separate instruction that clears SVMODES to all zero,
>> to allow calling scalar code quickly.
>
>
> ok so here's where the VBLOCK concept has a clear advantage: that extra instruction is not needed.  once the VBLOCK context is exited, the tear-down is automatic.

Actually, I think VBLOCK not being able to work on more than a single
basic-block at a time is a disadvantage compared to SVMODES. the
SVMODES-clear instruction would mostly only be used when calling code
that is not SVMODES-aware.

>
>
>>
>> Since using SVMODES instead of all SVPrefix instructions just makes
>> the code smaller and faster, but doesn't allow using any more
>> instructions than before, it could be used like a more complicated way
>> to compress instructions and it would be possible to limit all SVMODES
>> handling to after the register allocator in the compiler, similar to
>> how RVC instructions can be substituted entirely in the assembler,
>> without the compiler needing to know (though knowing allows selecting
>> more optimal code at the expense of a more complex compiler).
>>
>
> this is the logic behind VBLOCK-SVPrefix.  a compiler could (conceivably) simply output SVPrefix instructions, and a second-phase optimiser simply goes through them, spots which registers share the same prefixes, works out how to cascade them and *replaces* multiple SVPrefix instructions with a cascading VBLOCK-SVPrefix instead.
>
> so comparing the two:
>
> * SVMODEs uses CSRs, which is an inherent code-size penalty compared to VBLOCK-SVPrefix.

You're missing that SVMODES is changed by each instruction (basically,
SVMODES values follow values in registers), so it would basically only
need to be explicitly written on a context-switch. the rest of the
time, SVPrefix instructions would be used to enter vector mode.

Example:

add_loop:
setvl x0, a3 # only CSR-type instruction, the rest is done with SVPrefix
svp.lw x32(vector), (a0), SUBVL=3 # encoded using 8 in the register
field; sets SVMODES[x8] to SUBVL=3, vector, unpredicated
loop:
beqz a2, loop_end
svp.lw x64(vector), (a1), SUBVL=3 # encoded using 16 in the register
field; sets SVMODES[x16] to SUBVL=3, vector, unpredicated
add x32(vector), x32(vector), x64(vector), SUBVL=3 # encoded as add
x8, x8, x16 (could use RVC instruction if register numbers fit); sets
SVMODES[x8] to SUBVL=3, vector, unpredicated
addi a2, a2, -1
bnez a2, loop
loop_end:
svp.sw x32(vector), (a0), SUBVL=3 # encoded as sw x8, (a0) (could also
use RVC instruction if register numbers fit)
ret # no tear-down instructions, following code just has to use a
SVPrefix instruction if it uses an uninitialized register, most the
time a register will be written to the first time it's used, so
SVPrefix is not required for those instructions

equivalent C code:
void add_loop(int *a0, int *a1, int a2, int a3)
{
    int VL = a3;
    int x32[VL * 3]; // excuse my dynamic arrays
    int x64[VL * 3];
    for(int i = 0; i < VL * 3; i++)
        x32[i] = a0[i];
    while(a2 != 0)
    {
        for(int i = 0; i < VL * 3; i++)
            x64[i] = a1[i];
        for(int i = 0; i < VL * 3; i++)
            x32[i] += x64[i];
    }
    for(int i = 0; i < VL * 3; i++)
        a0[i] = x32[i];
}

> * SVMODEs requires tear-down instructions (another code-size penalty)
>
> these are, for me, the "killers" if the focus is to be on reducing code size (and thus I-Cache usage and thus power consumption).  the overhead of the CSR setup sequence was why i came up with VBLOCK in the first place: VBLOCK-SVPrefix simply continues to take advantage of that opportunity.
>
> l.
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe at groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1a019c40-03fb-4204-bc06-8b14a4e3c4c6%40groups.riscv.org.



More information about the libre-riscv-dev mailing list