[libre-riscv-dev] uniform instruction format

Mon Jun 17 09:14:48 BST 2019

On Mon, Jun 17, 2019 at 8:38 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> >  no, that's completely different: it's not called gather/scatter:
> > that's MV.X, which we discussed already is likely needed as a separate
> > operation, which i had already noted about 15 months ago in the
> > original SV was a necessary additional opcode.

> having vectorized random-access loads/stores is called gather/scatter
> (or sometimes indexed load/store). what you were thinking of
> (load/store from successive addresses of the form a*n+b) is called a
> strided load/store.

 erm.... ermermerm... ngggh so many different uses of the same words,
and i haven't looked at this for over 6 months.  i was thinking of
xBitManip's gather/scatter.

 in SV-original, going back over the spec, i wrote some pseudo-code:
 https://libre-riscv.org/simple_v_extension/specification/#load_store

 "vector" mode on the destination treats every destination-element as
a *separate* source base.

 "scalar" mode on the destination treats the (single) value in rd as a
srcbase, however it now goes into "strided" mode.

 although not a hugely efficient use of registers, when rd is in
"vector" mode, it can be used to provide all of the lsk modes, by
pre-computing a vector of base addresses that differ by a regular
amount (the stride).

 now that i'm looking at it, i'm quite nervous of the lsk "stride
specified by a register".  that means that there's now *five*
source/destination registers, i think:

 * rd (srcbase)
 * rs
 * pred-rs
 * pred-rd
 * stride-x8

 which is an awful lot of registers to jam into the Memory Dependency
Matrix, particularly when it comes to doing multi-issue (which creates
transitive dependencies on *all five* registers)

 it's seriously starting to get CISC, basically.

 plus, stride-x8 cannot be represented in SV-Original - there's no room for it.

> MV.X is a register gather: it's like a memory
> gather, but a memory gather is strictly more powerful in that MV.X can
> be emulated by a memory gather but not vice-versa. MV.X is just an
> optimization on doing a vector store then a memory gather.

 yehyeh.  ngggh.  which we'll need (separately) if swizzle can't be
fitted in.  btw about that, i looked at the khronos swizzle link: i'd
designed a 3D SHAPE permutation system (XYZ), and, if i'm reading the
khronos doc correctly, up to a 4D one is needed.   rats. that's up to
24 permutations (4x3x2) which either requires 5 bits or requires some
complex modulo 24 encoding.  messy.  Vectorised MV.X (aka VSELECT)
would be better.

> See also:
> intel's AVX512 gather/scatter instructions:
> https://www.felixcloutier.com/x86/vgatherqps:vgatherqpd
> https://www.felixcloutier.com/x86/vpscatterdd:vpscatterdq:vpscatterqd:vpscatterqq

 ye gods that's hard to understand.

(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j←0 TO KL-1
    i←j * 32
    k←j * 64
    IF k1[j] OR *no writemask*
        THEN DEST[i+31:i]←
            MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP]
            k1[j] ← 0
        ELSE *DEST[i+31:i]←remains unchanged*
    FI;
ENDFOR
k1[MAX_KL-1:KL] ← 0
DEST[MAXVL-1:VL/2] ← 0

that's very interesting.  they actually write (clear) the predicate
bit.  oo, i like it.  that makes repeating the instruction after a
trap real easy.  except, it also unfortunately makes *having* a
predicate a mandatory requirement of running the vector instructions.
yyyeah i remember now: that's why i didn't add that feature to
SV-Orig, because you'd have to have a full "predication state vector"
instead of just an index (VStart).

ok, so yes: although it's "offsets" (VINDEX), it's the same concept.

  for (int i = 0, int j = 0; i < VL && j < VL;):
    if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
    if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
    if (int_csr[rd].isvec)
      # indirect mode (multi mode)
      srcbase = ireg[rsv+i];
    else
      # unit stride mode
      srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes

SV-Orig treats the vector-dest as absolute BASE_ADDRs, VX512 treats
the VINDEX as relative *to* BASE_ADDR.

SV-Orig can emulate VX512 (however the opposite cannot be done)

> > my understanding of the conventions from the design of Vector
> > Instruction Sets gather/scatter is a linear gather and a linear
> > scatter: the sequence to be gathered is expressed in a
> > sequentially-increasing set of indices: the sequence to be scattered
> > is expressed in a sequentially-increasing set of indices.
> from what I understand (and any decent GPU requires) is that
> gather/scatter operations do not require the vector of addresses to
> have any patterns (other than maybe requiring the addresses to be
> aligned to a multiple of the element size).

 yehyeh, got it now - had to look up the spec.

> >
> > MV.X, which really requires a completely new opcode, permits an
> > arbitrary unordered set of indices.
> >
> > damnit it's been such a long time i can't even recall properly the
> > months and months of work i did on the original SV.  i *think* i found
> > a way to mark one of the registers. i'll have to look again.
> >
> >
> > ah.  ok.  right, i think i understand.  LLVM i believe uses the
> > concept "gather/scatter" incorrectly, assigning it *arbitrary*
> > (unordered) meaning.   through the mistaken naming (by calling it
> > gather/scatter instead of MV.X), confusion has arisen.
> actually, LLVM uses gather/scatter correctly. if llvm had been
> incorrect, why does x86's instruction set naming agree?

 misunderstanding - too many different uses of the same naming
(xBitManip) and it's been too long, i'd gotten confused.

l.