[Libre-soc-dev] memcpy optimization

Sat Dec 12 13:40:32 GMT 2020

let's go over the RVV ffirst strncpy assembler.  complete fn:

RVV version:

    strncpy:
        c.mv a3, a0               # Copy dst
    loop:
        setvli x0, a2, vint8    # Vectors of bytes.
        vlbff.v v1, (a1)        # Get src bytes
        vseq.vi v0, v1, 0       # Flag zero bytes
        vmfirst a4, v0          # Zero found?
        vmsif.v v0, v0          # Set mask up to and including zero byte.
        vsb.v v1, (a3), v0.t    # Write out bytes
        c.bgez a4, exit           # Done
        csrr t1, vl             # Get number of bytes fetched
        c.add a1, a1, t1          # Bump src pointer
        c.sub a2, a2, t1          # Decrement count.
        c.add a3, a3, t1          # Bump dst pointer
        c.bnez a2, loop           # Anymore?

    exit:
        c.ret

now let's go over specific lines.

        setvli x0, a2, vint8    # Vectors of bytes.

with a2 being e.g. 100,000 VL is set to MAXVL.  this being RVV it is a
hardcoded MVL.

        vlbff.v v1, (a1)        # Get src bytes

VL *may* be modified here to limit the LDs to only those that do not fault.

this is the *ONLY* time that VL will be so modified.

        vseq.vi v0, v1, 0       # Flag zero bytes

this is a vector test operation.  every byte is tested, "is it zero".
note, ONLY AND ALL ELEMENTS FROM 0 TO VL-1 are tested.  THIS DOES NOT
ALTER VL.

       vmfirst a4, v0          # Zero found?

this is a "count up to first zero" operation, similar to cntlones
except for Vectors rather than bits.

THIS DOES NOT ALTER VL.

        vmsif.v v0, v0          # Set mask up to and including zero byte.

this is where the vector predicate mask is created.  it is all 1s
right up to (and including) where the first zero is detected.

THIS DOES NOT ALTER VL.

        vsb.v v1, (a3), v0.t    # Write out bytes

using the mask created above and using VL WHICH HAS NOT BEEN ALTERED
the ST operation writes out only those elements where the predicate
bit is 1.

VL is *STILL* not altered here.

bizarrely it is the vmfirst that fires the loop to continue.  the
basis being that if there was even one zero, you're on the last loop.

other oddities: VL needs to be re-read (CSR read) due to the
alteration by the ffirst LD.  there is no way round this, it is not
like VL can be made a standard INT reg (it could but the Dep Tracking
is hell).

memcpy is therefore pretty much exactly the same with the predicate
mask detection and zero detection stripped out.

        c.mv a3, a0               # Copy dst
    loop:
        setvli x0, a2, vint8    # Vectors of bytes.
        vlbff.v v1, (a1)        # Get src bytes
        vseq.vi v0, v1, 0       # Flag zero bytes
        vsb.v v1, (a3)        # Write out bytes
        csrr t1, vl             # Get number of bytes fetched
        c.bgez t1, exit           # Done
        c.add a1, a1, t1          # Bump src pointer
        c.sub a2, a2, t1          # Decrement count.
        c.add a3, a3, t1          # Bump dst pointer
        c.bnez a2, loop           # Anymore?

    exit:
        c.ret

the vmfirst and vmsif have gone, the ST has the predicate mask gone,
and the CSR load of VL has a bgez t1 after it instead of a bgez a3.

those are the *only* modifications.

yet, again, i repeat, again: *at no time* was VL altered by any of the
code so removed, in order to morph strncpy into memcpy.  it was
*predication* that truncated the ST ( *NOT A SECOND ALTERATION OF VL*
which does not exist)

in both cases, strncpy and memcpy, the truncation of VL by the LD is
what makes the subsequent ST "safe" to perform.

clear now?

l.