[libre-riscv-dev] SV Prefix questions

Wed Jun 26 06:27:35 BST 2019

Since there has been lots of confusion about how vsetvl should work in SVPrefix
(mostly caused by multiple typos in the original SVPrefix spec), I'll try to
thoroughly explain exactly how I think vsetvl should be implemented by deriving
the pseudo-code from the V extension spec.

I'm not doing this reply inline to have the order make more sense, otherwise I'd
have the conclusion before the premise, which would make less sense. :)

Starting from the V extension spec:
https://github.com/riscv/riscv-v-spec/blob/e014590220e7b95b1dfa3c0665277ae1550828c9/v-spec.adoc#vsetvlivsetvl-instructions

> vsetvli rd, rs1, vtypei # rd = new vl, rs1 = AVL, vtypei = new vtype setting
>                         # if rs1 = x0, then use maximum vector length
> vsetvl  rd, rs1, rs2    # rd = new vl, rs1 = AVL, rs2 = new vtype value
>                         # if rs1 = x0, then use maximum vector length

so AVL refers to rs1 in the following description.

> The vsetvli instruction sets the vtype and vl CSRs based on its arguments,
> and writes the new value of vl into rd.
>
> The new vtype setting is encoded in the immediate fields of vsetvli and in
> the rs2 register for vsetvl.

Since SVPrefix doesn't need vtype since it's provided in instruction fields, we
remove rs2 and vtypei from vsetvl{i} (it will later be reused).

> The vsetvl{i} instructions first set VLMAX according to the vtype argument,
> then set vl obeying the following constraints:
>
> 1. vl = AVL if AVL ≤ VLMAX
>
> 2. vl ≥ ceil(AVL / 2) if AVL < (2 * VLMAX)
>
> 3. vl = VLMAX if AVL ≥ (2 * VLMAX)
>
> 4. Deterministic on any given implementation for same input AVL and VLMAX values
>
> 5. These specific properties follow from the prior rules:
>
>    i. vl = 0 if AVL = 0
>
>    ii. vl > 0 if AVL > 0
>
>    iii. vl ≤ VLMAX
>
>    iv. vl ≤ AVL
>
>    v. a value read from vl when used as the AVL argument to vsetvl{i} results in the same value in vl, provided the resultant VLMAX equals the value of VLMAX at the time that vl was read
>
> Note:
> The vl setting rules are designed to be sufficiently strict to preserve vl
> behavior across register spills and context swaps for AVL ≤ VLMAX, yet
> flexible enough to enable implementations to improve vector lane utilization
> for AVL > VLMAX.
>
> For example, this permits an implementation to set vl = ceil(AVL / 2) for
> VLMAX < AVL < 2*VLMAX in order to evenly distribute work over the last two
> iterations of a stripmine loop. Requirement 2 ensures that the first
> stripmine iteration of reduction loops uses the largest vector length of all
> iterations, even in the case of AVL < 2*VLMAX. This allows software to avoid
> needing to explicitly calculate a running maximum of vector lengths observed
> during a stripmined loop.

consistant with those constraints, the following algorithm is chosen:

// AVL is considered to be infinity for AVL = x0

if AVL <= VLMAX {
    // rule 1
    vl = AVL
} else if AVL < (2 * VLMAX) {
    // lower bound by rule 2; allows evenly distributing work
    // over last two iterations as mentioned in note; ceil is selected to make
    // vl be decreasing to simplify reduction as mentioned in note.
    vl = ceil(AVL / 2)
} else {
    // rule 3
    vl = VLMAX
}

In SVPrefix, the compiler allocates registers that hold the backing storage for
all vectors used, hence the compiler knows the value of VLMAX at compile time
for all loops.

To avoid needing a separate instruction to set VLMAX for every loop, the unused
immediate field of vsetvli is used to encode VLMAX.
The final algorithm is as follows:

let mut regs = [0u64; 128];
let mut vl = 0;

// instruction fields:
let rd = get_rd_field();
let rs1 = get_rs1_field();
let vlmax = get_immed_field();

// handle illegal instruction decoding
if vlmax > XLEN {
    trap()
}

// calculate AVL
let avl;
if rs1 == 0 {
    // rs1 is x0, so set avl to be infinity
    avl = 10000 // or some other integer much larger than vlmax
} else {
    avl = regs[rs1]
}

// calculate VL
if avl <= vlmax {
    vl = avl
} else if avl < 2 * vlmax {
    // ceil(avl / 2), since integer div rounds down
    vl = (avl + 1) / 2
} else {
    vl = avl
}

// write rd
if rd != 0 {
    // rd is not x0
    regs[rd] = vl
}

To avoid confusion with the V extension's instruction, the mnemonic
svp.setvl is chosen.

svp.setvl is an I-type instruction.

I think that svp.* should be the prefix for all the new instructions added by
SVPrefix (similar to how the C extension adds things like c.addi or c.mv).

Now for some example code:

DAXPY:

C:
void daxpy(double *x, double *y, double a, size_t count)
{
    while(count > 0)
    {
        *y += a * *x;
        x++;
        y++;
        count--;
    }
}

assembly:
// this is not the most optimal code, but it works
daxpy:
    // x is a0, y is a1, a is fa0, count is a2
.loop:
    svp.setvl a3, a2, 48 // VLMAX is 48, since we have space for 48 registers
    beqz a3, .exit
    svp.fld.vs f32, 0(a0), ELTYPE=f64 // f32-79 is a vector of *x
    svp.fld.vs f80, 0(a1), ELTYPE=f64 // f80-127 is a vector of *y
    svp.fmadd.vvsv f32, f32, fa0, f80, ELTYPE=f64 // f32.. = f32.. * fa0 + f80..
    svp.fst.vsd.vs f32, 0(a1), ELTYPE=f64 // store back to *y
    sub a2, a2, a3 // reduce count
    shli a3, a3, 3 // a3 is now size in bytes since sizeof(f64) == 1 << 3
    add a0, a0, a3 // update x
    add a1, a1, a3 // update y
    j .loop
.exit:
    ret

Jacob