[Libre-soc-bugs] [Bug 558] gcc SV intrinsics concept

Mon Dec 28 23:18:01 GMT 2020

https://bugs.libre-soc.org/show_bug.cgi?id=558

--- Comment #15 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Alexandre Oliva from comment #9)
> There's nothing "just" about getting GCC to turn scalar insns into
> vector ones, I'm afraid.

ok.  so to explain: that is not what i am proposing.

not in an explicit way.

bear in mind that SV is effectively a built-in macro for-loop.  thus one
potential approach (not one i am recommending) is to literally chuck out a
batch of  add r3 add r4 add r5 add r6 instructions from gcc *as scalars* and
have a post-analysis phase spot the patterns and vectorise them.

i do not recommend doing this: the only reason i mention it is so as to grok
the concept of the hardware macro for-loop better.

alexandre can you take a look at the example given in the reddit page? let me
find it.

> #include <riscv_vector.h>
> #include <stdio.h>
>
> void vec_add_rvv(int *a, int *b, int *c, size_t n) {
>   size_t vl;
>   vint32m2_t va, vb, vc;
>   for (;vl = vsetvl_e32m2 (n);n -= vl) {
>     vb = vle32_v_i32m2 (b);
>     vc = vle32_v_i32m2 (c);
>     va = vadd_vv_i32m2 (vb, vc);
>     vse32_v_i32m2 (a, va);
>     a += vl;
>     b += vl;
>     c += vl;
>   }
> }

this is the level i'd like to see supported.  bruce describes it as "barely
above machine code" level.

the compiler *does not* do actual vectirisation.

the compiler *does not* know anything about VL.

vsetvl_xxx are intrinsics that literally get converted, verbatim, to a single
assrmbly instruction.  gcc has been taught to recognise that vl is a size_t
returned from this "function"

except, due to the nature of SV it is instead:

#include <sv_vector.h>
#include <stdio.h>

void vec_add_svv(int *a, int *b, int *c, size_t n) {
   size_t vl;
   __attribute__{sv_vector} uint32_t *va, *vb, *vc ;
   PUSH_SV_CONTEXT(MAXVL=8)
   for (;vl = vsetvl_e32m2 (n);n -= vl) {
     vb = b;
     vc = c;
     // this issues an svp64 prefixed add
     *va = *vb + *vc; // looks scalar: isn't
     // this issues an svp64 prefixed mv
     *a = *va; // again: looks scalar.
     // these really are scalar
     // because they are not vector intrinsics
     a += vl;
     b += vl;
     c += vl;
   }
   POP_SV_CONTEXT()
 }

> Register allocation is largely driven by insn definitions and their
> requirements, 

in the above example it is MAXVL which tells gcc, when the add and the mv is
performed, that va, vb and vc have had 4 64 bit registers allocated to them.

why 4? because MAXVL=8 and va-c have been declared as uint32_t.  that means
elwidth=32 and consequently 2 elements fit into 64 bit, so MAXVL=8 takes up 4
regs.

> and the definitions carry machine modes requirements that
> establish data size and what kind of data it is, which establish whether
> the data fits at all in a register file, and how many registers of that
> file are needed to hold an object of that machine mode.
> 
> The prefix also alters insn length, that influences computations about
> branch distances and constant pool placement, and scheduling (units,
> latency) is very significantly affected by the fact that vector-prefixed
> insns are actually multiple insns issued in sequence.

ok this, the instruction length alteration and the register allocation, yes,
this i get needs doing.

> 
> Though GCC vectorizers can turn certain loops and sequences of
> instructions into vectors, they largely rely on the available vector
> modes.

right.  i do *not* recommend going down the autovectorisation route or
"teaching" gcc about SV vectorisation at this point.  this will be months of
work.

take a look at how the RVV gcc intrinsics work: i'm pretty certain that you'll
find that they're a type of cheat, based effectively on function calls and data
types (hence the #include) that have little actual "depth"

> It doesn't seem unreasonable to define all of the available combinations
> of vector lengths and component modes as vector machine modes, and to
> define the vector versions of scalar insns as template insns,
> parameterized with the lengths and modes over each operand.

later.  not now.  that's phase 2 (requiring many man-months)

what i would like to see happen here is a similar "cheat", where the register
allocation is multiplied up by MAXVL when taken from the current Vector Stack
Context, yet aside from that the actual "modification" to gcc is absolute bare
minimum.

> That's quite some work, though not particularly challenging.
> 
> What I'm not sure about how to model is the vector length: though maxvl
> is a compile time constant, vl is dynamic, and it may vary; it needs a
> register of its own, for us to be able to represent setting it up,

yes.  if you look at how it's done in the RVV gcc patches, we need pretty much
exactly the same thing, and i do mean exactly.

if the rvv gcc setvl code cannot be near-verbatim copied there is something
wrong.

> and
> modifying it as a side effect, but it has to somehow be constrained to
> the compiler's notion of what the maxvl is for the insn, since that is
> what guides register allocation.

no, it is definitively MAXVL that specifies the register *allocation*, whilst
it is VL that determines precisely how many of those registers actually got
modified.

that number is most emphatically *not* - at this point, at this phase - gcc's
"problem"

*phase 2* which will involve autovectorisation is where VL *will* be gcc's
problem, because VL (and MAXVL) will be entirely hidden from the program
writer.

this is *not* phase 2.

> 
> There is a possibility of introducing vector-prefixed variants
> mechanically, as modified versions of existing scalar insns, using
> machinery similar to the way conditional insns are introduced on ARM,
> but I'd have to look into that to see whether it's really viable.

a better lead would be that x86 "rex" tagging that jacob mentioned? how x86
turned 32 bit regs into 64 bit regs with a "tag".

> Intrinsics can give direct access to features that the vectorizer can't
> (yet?) introduce on its own;

to be absolutely clear: i am *not* proposing modification of gcc's vectorizer
or getting to that phase in *any* way right now.

this is *specifically* an abbbsolute bare minimum modification to gcc that is
just above assembly level.

> I suppose masks and twin predicates might
> be missing in general, though I'd have to look at how conditionals are
> dealt with in the vectorizer to tell for sure.

masks would be via the PUSH_SV_CONTEXT system, and via __attribute__{mask}.

a variable would be marked as being a predicate: the PUSH_SV_CONTEXT would name
that variable as a src or dest mask.

register allocation would pick r3, r10 or r30 (for int predication) and would
be responsible for pushing old contents into other regs.

other than that it would be the programmer's responsibility to think through
the consequences of the SV_CONTEXT having had twin predication applied.

> Introducing intrinsics that map more or less directly to the
> corresponding versions of vector insns templates is probably, including
> the extra prefix-only operands, is not unreasonable.  It is quite some
> work but, again, not particularly challenging, and quite possibly
> automatable.

and something to do for a stage 2 grant proposal.

> 
> My suggested approach is to start out with one scalar insn and a handful
> of vector lengths, say 32-bit integer add with wrap-around, and try to
> get them used on vectors, as intrinsics and through the vectorizer, and
> from that try to estimate the amount of effort to cover more
> possibilities.

one instruction always works for me.  the hilarious thing is, though,
alexandre, that in this case i think you'll find that the abstraction approach
i am advocating, once you have added even one instruction absolutely every
other instruction follows.

ok it may be the case that once one 2-operand instruction has been vectorised,
every 2-op follows.

please understand though that i am *very specifically NOT*, in any way shape or
form advocating alteration, augmentation or involvement of the gcc vectorizer
or of autovectorization for this grant proposal.

this is *very specifically* a leveraging of gcc barely above machine code level
to see how far that gets.

the only really significant modifications will be to the Register Allocation
Table to get it to understand MAXVL multipliers.

-- 
You are receiving this mail because:
You are on the CC list for the bug.