[libre-riscv-dev] [PATCH] D57504: RFC: Prototype & Roadmap for vector predication in LLVM

Mon Feb 3 03:07:20 GMT 2020

On Sunday, February 2, 2020, Jacob Lifshay via Phabricator
<reviews at reviews.llvm.org> wrote:
>
> programmerjake added a comment.

>
> Yes, we do (setvl has a immediate for max VL, which needs to be calculated by the register allocator or similar), though it can be bypassed by writing directly to the VL register.
>
> So, in that case, we should be able to use option #2 or #3, as long as the compiler doesn't write to VL by any means other than setvl.
>

so, the way that the ISA works is as jacob describes: setvl has an
immediate mode, it has one extra argument compared to RVV setvl: we
include the MAXVL immediate *in* the actual instruction.

whatever MAXVL is set to, that defines precisely and exactly the
macimum amount of the *scalar* register file is to be "typecasted" to
an array of vector elements (in RVV MAXVL is the number of "lanes". we
do not actually have lanes, we only have a scalar regfile that does
double-duty).

we also set the requirement that if you ask for a specific vl, then as
long as it is less than MAXVL, you *get* vl elements worth of
operations.

RVV permits the microarchitecture to return a DIFFERENT number of
vector elements to be processed than there are actual "lanes": we do
not [because there *are* no "lanes"]

this provides assembly writers with 100% guarantees that if they ask
for 9 elements to be executed, they'll damn well *get* 9 elements
executed, and they have no need to read or test vl or do loops in
order to do sequences of elements up to 64 long, it is just one
instruction.  one instruction vs 5 to 7 instructions, it is easy math.

the backend will actually perform a hardware for-loop issuing
individual predicated *scalar* operations, as if the vector
instruction had *actually been* a sequence of scalar instructions
rather than the one "vector" instruction.

we then use a multi-issue OoO engine and a SIMDifier to group as many
of those scalar operations into actual pipelines, each with their
associated individual predicate bit, as we have room for in the ALUs.

those predicate bits are of course in a register (we have no way to
pass them as an immediate) and that predicate has to be read from the
regfile.

but to do that right there and then at the *instruction issue* phase
is hugely problematic in an OoO multi issue engine, you have to
completely freeze ALL further decodes until the read of the predicate
bits returns, which of course would itself interfere with the OoO
dependency management.  this is just how it is with an OoO engine when
you have missing information.

the alternative which works far better is to speculatively execute
absolutely every element, cast a "shadow" which prevents "result
commit" to the regfile, and when the predicate reg *has* been read,
release the shadow for every element where its predbit==1 and
terminate those element executions in progress where predbit==0.

this i summarise briefly at the following page
https://libre-riscv.org/3d_gpu/architecture/

therefore (apologies for having to give all that background material)

if we had to use masks at the end elements in cases where %evl was
greater than W (actual vector amount) in order to "mask out" the
"extra" elements (so as to avoid irreparable corruption of the
*scalar* regfile) then because that predicate mask can only be passed
in via a register (not as an immediate) it would potentially severely
degrade performance by jamming up the OoO issue phase with operations
that we (as assembly writers) *know* are going to be cancelled, but
the issue engine does not and simply cannot know.

so please don't force %evl to be restricted to powers of two for
example (and assume that it's ok to set the predicate mask high bits
to mask out NonP^2 elements, this will severely degrade performance),
let us have full explicit control over its value, and please, really,
it would be really appreciated to avoid other scenarios where we have
to deploy that trick (%evl > W).

it has taken over a year to understand and design the engine and i
would really like to avoid major architectural changes at this stage.

thank you all for patiently reading such a deep architectural
explanation of hardware on a compiler mailing list.

l.