[Libre-soc-dev] memcpy optimization

Fri Dec 11 23:06:34 GMT 2020

On Fri, Dec 11, 2020, 14:25 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On 12/11/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
> > On Fri, Dec 11, 2020, 11:20 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> > wrote:
> >
> >> the general dynamic case [of memcpy], when either the count or alignment
> >> is *not*
> >> known, is however flat-out impossible to use 64 bit granularity:
> >> that's the seductive SIMD way.
> >
> >
> > agreed.
> >
> > the only way to make dynamic general
> >> memcpy efficient is to use fail-on-first.
> >>
> >
> > no, fail on first is used when you are using a data-dependent loop count,
>
> again: this is incorrect.  it was a mistake for me to refer you to the
> strncpy example to illustrate that ffirst applies to memcpy.
>
> the end-of-string detection has *nothing to do with the LD*.
>
> i repeat, again: ffirst has NOTHING to do with the LD or with zero
> detection.
>
> look again at the assembly code.
>
> the zero detection *uses* the (new) VL.  the zero detection uses mask
> ops to find the zero point.
>
>
> > memcpy is data-independent (copies the same number of bytes no matter
> what
> > byte values it sees).
>
> you're misunderstanding how vectors work, and not listening.  VL.is
> not a hard fixed quantity, *MVL* is the invarying quantity.
>

what I meant by data-invariance is that memcpy doesn't suddenly change size
because a byte is zero. VL doesn't change because a byte is zero --
data-invariant. VL getting changed by ffirst isn't because it hit a zero
byte, but because ffirst hit a page-fault.

>
> the fact that VL need not be exactly the requested amount (i.e. is
> modified by the LD) can be exploited to optimise subseqient LDs.
>

yes, but that only happens if there *is* an unmapped page. if there isn't
an unmapped page, your still stuck with the bad alignment because the load
succeeded.

this is *hugely* beneficial to performance to have the loop
> miraculously self-align, because those non-aligned LDs are actually
> incredibly expensive at the hardware level.
>

yes, hence why I proposed the 3-argument setvl instruction in the previous
email -- we want good performance even if coping from/to already mapped
pages.

>
> in our implementation we need double the number of LDST Buffers to be
> able to cope with misalignment, and coping with those misalignments
> across page boundaries is going to get real hairy.
>

good thing SV already has a way to indicate that an instruction is
partially complete: vstart

>
> > if there's a page-fault (even if not using vector instructions at all)
> > either that's a sigsegv or invisible to user code, so memcpy doesn't use
> > fail-on-first.
>
> scalar code with byte quantities does single bytes which is hugely
> suboptimal and consequently yes no page fault hits in the middle of
> the LDs.
>
> > code (ignoring memcpy's return value):
> > memcpy: # r3=dest, r4=src, r5=count
> >     setvl r6, r5, maxvl=64
> >     ld <vec>r64, (<scalar>r4), elwidth=1
> >     st <vec>r64, (<scalar>r3), elwidth=1
> >     sub. r5, r5, r6
> >     add r3, r3, r6
> >     add r4, r4, r6
> >     bne memcpy
> >     blr
>
> here however what is the max that VL.can be... ah, up to 64.
>
> so there will be up to 8x 64 bit LDs in one hit.
>
> that means that the 8 LDs are very likely to fault.
>
> that in turn, because there are so many, results in an average of 4 64
> bit LDs being chucked out of the LDST Buffer (cancelled) due to a page
> fault and associated trap handling.
>
> that throwing page faults is SERIOUSLY suboptimal and if they are all
> misaligned the resource utilisation is absolutely dreadful.
>

yeah, but page faults are really slow anyways, just dropping ops instead of
using vstart which is designed for this will not cost very much more than a
page fault.

>
> so i repeat again: strncpy zero detection is *not* the driver behind
> the use of ffirst.  getting the parallel LDs to exclude misalignments
> (and other faulting) is the key driving factor behind why ffirst is
> used in strncpy.
>

> those exact same characteristics *also apply to memcpy and memset*.
>

not really, since we know the length ahead of time (VL, *not* maxvl) and
don't have to try to load 64 bytes only to find out we only should have
read 3 because that's where the null is. ffirst for strcpy makes it so we
can try to load all 64 bytes without causing a sigsegv, which then allows
us to check those bytes for zeros using a later instruction.

memcpy doesn't have that issue since we don't have to speculatively read to
avoid causing a sigsegv (different than page-fault) for memcpy to be
correct -- we just read up to the end and use setvl to stop reading at the
right spot. if it causes a sigsegv, then the scalar version would have also
sigsegv-ed so that's correct.

now, ffirst can help with alignment, but only if a page was swapped-out
(not that common for a lot of code), otherwise it loads the full VL and
doesn't change it.

Jacob