[libre-riscv-dev] SV Prefix questions

Wed Jun 26 06:32:49 BST 2019

Note that the algorithm below is NOT the same one as in the original
SVPrefix spec (which is flawed due to multiple typos).

On Tue, Jun 25, 2019 at 10:27 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> Since there has been lots of confusion about how vsetvl should work in SVPrefix
> (mostly caused by multiple typos in the original SVPrefix spec), I'll try to
> thoroughly explain exactly how I think vsetvl should be implemented by deriving
> the pseudo-code from the V extension spec.
>
> I'm not doing this reply inline to have the order make more sense, otherwise I'd
> have the conclusion before the premise, which would make less sense. :)
>
> Starting from the V extension spec:
> https://github.com/riscv/riscv-v-spec/blob/e014590220e7b95b1dfa3c0665277ae1550828c9/v-spec.adoc#vsetvlivsetvl-instructions
>
> > vsetvli rd, rs1, vtypei # rd = new vl, rs1 = AVL, vtypei = new vtype setting
> >                         # if rs1 = x0, then use maximum vector length
> > vsetvl  rd, rs1, rs2    # rd = new vl, rs1 = AVL, rs2 = new vtype value
> >                         # if rs1 = x0, then use maximum vector length
>
> so AVL refers to rs1 in the following description.
>
> > The vsetvli instruction sets the vtype and vl CSRs based on its arguments,
> > and writes the new value of vl into rd.
> >
> > The new vtype setting is encoded in the immediate fields of vsetvli and in
> > the rs2 register for vsetvl.
>
> Since SVPrefix doesn't need vtype since it's provided in instruction fields, we
> remove rs2 and vtypei from vsetvl{i} (it will later be reused).
>
> > The vsetvl{i} instructions first set VLMAX according to the vtype argument,
> > then set vl obeying the following constraints:
> >
> > 1. vl = AVL if AVL ≤ VLMAX
> >
> > 2. vl ≥ ceil(AVL / 2) if AVL < (2 * VLMAX)
> >
> > 3. vl = VLMAX if AVL ≥ (2 * VLMAX)
> >
> > 4. Deterministic on any given implementation for same input AVL and VLMAX values
> >
> > 5. These specific properties follow from the prior rules:
> >
> >    i. vl = 0 if AVL = 0
> >
> >    ii. vl > 0 if AVL > 0
> >
> >    iii. vl ≤ VLMAX
> >
> >    iv. vl ≤ AVL
> >
> >    v. a value read from vl when used as the AVL argument to vsetvl{i} results in the same value in vl, provided the resultant VLMAX equals the value of VLMAX at the time that vl was read
> >
> > Note:
> > The vl setting rules are designed to be sufficiently strict to preserve vl
> > behavior across register spills and context swaps for AVL ≤ VLMAX, yet
> > flexible enough to enable implementations to improve vector lane utilization
> > for AVL > VLMAX.
> >
> > For example, this permits an implementation to set vl = ceil(AVL / 2) for
> > VLMAX < AVL < 2*VLMAX in order to evenly distribute work over the last two
> > iterations of a stripmine loop. Requirement 2 ensures that the first
> > stripmine iteration of reduction loops uses the largest vector length of all
> > iterations, even in the case of AVL < 2*VLMAX. This allows software to avoid
> > needing to explicitly calculate a running maximum of vector lengths observed
> > during a stripmined loop.
>
> consistant with those constraints, the following algorithm is chosen:
>
> // AVL is considered to be infinity for AVL = x0
>
> if AVL <= VLMAX {
>     // rule 1
>     vl = AVL
> } else if AVL < (2 * VLMAX) {
>     // lower bound by rule 2; allows evenly distributing work
>     // over last two iterations as mentioned in note; ceil is selected to make
>     // vl be decreasing to simplify reduction as mentioned in note.
>     vl = ceil(AVL / 2)
> } else {
>     // rule 3
>     vl = VLMAX
> }
>
> In SVPrefix, the compiler allocates registers that hold the backing storage for
> all vectors used, hence the compiler knows the value of VLMAX at compile time
> for all loops.
>
> To avoid needing a separate instruction to set VLMAX for every loop, the unused
> immediate field of vsetvli is used to encode VLMAX.
> The final algorithm is as follows:
>
> let mut regs = [0u64; 128];
> let mut vl = 0;
>
> // instruction fields:
> let rd = get_rd_field();
> let rs1 = get_rs1_field();
> let vlmax = get_immed_field();
>
> // handle illegal instruction decoding
> if vlmax > XLEN {
>     trap()
> }
>
> // calculate AVL
> let avl;
> if rs1 == 0 {
>     // rs1 is x0, so set avl to be infinity
>     avl = 10000 // or some other integer much larger than vlmax
> } else {
>     avl = regs[rs1]
> }
>
> // calculate VL
> if avl <= vlmax {
>     vl = avl
> } else if avl < 2 * vlmax {
>     // ceil(avl / 2), since integer div rounds down
>     vl = (avl + 1) / 2
> } else {
>     vl = avl
> }
>
> // write rd
> if rd != 0 {
>     // rd is not x0
>     regs[rd] = vl
> }
>
> To avoid confusion with the V extension's instruction, the mnemonic
> svp.setvl is chosen.
>
> svp.setvl is an I-type instruction.
>
> I think that svp.* should be the prefix for all the new instructions added by
> SVPrefix (similar to how the C extension adds things like c.addi or c.mv).
>
> Now for some example code:
>
> DAXPY:
>
> C:
> void daxpy(double *x, double *y, double a, size_t count)
> {
>     while(count > 0)
>     {
>         *y += a * *x;
>         x++;
>         y++;
>         count--;
>     }
> }
>
> assembly:
> // this is not the most optimal code, but it works
> daxpy:
>     // x is a0, y is a1, a is fa0, count is a2
> .loop:
>     svp.setvl a3, a2, 48 // VLMAX is 48, since we have space for 48 registers
>     beqz a3, .exit
>     svp.fld.vs f32, 0(a0), ELTYPE=f64 // f32-79 is a vector of *x
>     svp.fld.vs f80, 0(a1), ELTYPE=f64 // f80-127 is a vector of *y
>     svp.fmadd.vvsv f32, f32, fa0, f80, ELTYPE=f64 // f32.. = f32.. * fa0 + f80..
>     svp.fst.vsd.vs f32, 0(a1), ELTYPE=f64 // store back to *y
>     sub a2, a2, a3 // reduce count
>     shli a3, a3, 3 // a3 is now size in bytes since sizeof(f64) == 1 << 3
>     add a0, a0, a3 // update x
>     add a1, a1, a3 // update y
>     j .loop
> .exit:
>     ret
>
> Jacob