[libre-riscv-dev] [isa-dev] Re: SV / RVV, marking a register as VL.
luke.leighton at gmail.com
Mon Sep 2 15:27:15 BST 2019
On Monday, September 2, 2019 at 1:04:31 PM UTC+1, Jacob Lifshay wrote:
> > if the same idea was applied *outside* of a VBLOCK, there would be no
> such guarantee. hmm, so that would need to be thought through.
> I am proposing that the vector/scalar flags would cascade throughout
> the whole function/multiple functions,
single function, no problem. multiple functions: now it's an order of
magnitude more complex. the cascade now becomes part of the contract for
the function API. that's an entire academic research project, all on its
we *need* to keep this simple and straightforward, jacob.
> which can be handled in the
> compiler with a rather simple data-flow analysis.
it's nowhere neeeear simple when branch/control is involved.
> You could think of
> it as tagging the registers with a vector/scalar flag.
yes. same principle for the Mill (except they tag with bit-width and other
ok, so let's say that there's a function which, on entry, the first thing
it does is a data-dependent branch.
what should the engine do, when cascading the registers *from following the
it's going to follow the branch on one call of the function, which will
cause a completely different set of registers to become cascade-tagged as
"vectorised", when compared to a subsequent call which goes the other
direction, isn't it?
now what happens when there are six nested loops and multiple
data-dependent paths, as there is in the MP4 decode algorithm?
do you imagine that a compiler would be able to cope with that?
if it was successfully patched, what do you rate the chances of successful
upstream acceptance of such a patch?
i'd rate the chances of the success of such an effort at close to zero.
so let's instead go with a sequential ordering from the numbering (address)
on the instructions. instead, the cascade occurs in *instruction* order,
as opposed to "execution" order.
now we have a hardware-level nightmare: it's now necessary to keep track of
the program counter, perhaps read a dozen or perhaps even hundreds of
instructions, tracking them all in order to create and maintain the cascade
the complexity is off the scale for one scenario (in hardware), and off the
scale in a different direction (in compiler technology) for the other.
> this is where it gets wasteful. the number of permutations that raise
> illegal instruction traps is so high that it suggests that the encoding is
> not a good one. i feel it would be better to use the 2 bits for elwidth.
> I think that SUBVL will actually be more useful there than elwidth,
> though those both will be very useful. The majority of graphics
> shaders use only i32/u32 and f32, whereas most of them use a range of
> SUBVL values.
we're not just covering 3D graphics: SV is for Video Processing, numerical
computation, and a lot more. (the meetup at WD on thursday gave some great
feedback, and someone specifically asked for Video Processing to be
with scalar RISC-V entirely missing compact 8-bit and 16-bit operations,
elwidth overrides are a sane way to get them (in bulk i.e. vectorised).
remember that it's perfectly possible to call SV.SETVL within a VBLOCK. or
also, that setting SUBVL in a VBLOCK will set it globally. in the example
you gave, i believe that the entirety of the individual "SET SUBVL" marks
may be replaced with one single global SET SUBVL, right at the top of the
can you think of an example where SUBVL would need to change hugely and
frequently, *and* where it would be sub-optimal to use VBLOCK-SVP
"grouping" of opcodes (when compared to it being mandatory to call
SVMODES-clear at the end of a function *or* before actually *calling* a
> 128 bits worth of context-saving...
> not too bad since, if we design it right, saving/restoring SVMODES can
> be skipped for most system calls, it would only need to be
> saved/restored for context switches between processes.
to give some context (haha), the reason why i designed VBLOCK in the first
place was because i considered 128 bits worth of CSR setup in SV-Orig to be
far too much.
in addition, if the assumption is that system calls will use registers that
are not involved in the cascade, that's a *really* dangerous assumption
(and/or requires an entire recompilation of pretty much every single
distro's source code to make sure that it _is_ a correct assumption).
and/or requires setting certain boundaries (such as not utilising registers
x1-x31) for VBLOCKs, which then limits the entire purpose of the exercise.
keeping things "isolated" to a single function (or to isolated functions in
the same source code file that cannot be called externally) is the safest
and simplest thing to do, and that means *not* spilling the cascade-context
outside of the places where it's used.
think about it: one register happens to be marked as "vectorised", run on a
function that is supposed to be scalar, it'll destroy the entire function
by using registers as vectors that were never intended to be used as such.
it's just not safe, jacob, and expecting the entirety of the GNU/Linux
world to recompile 30,000 packages is not a reasonable expectation,
either. as a hybrid processor, it has to be "compatible" within the
confines of the UNIX Platform Spec.
> >> on return from an exception, SVMODES would be restored (by copying
> >> back from the saved csr or by switching which SVMODES csr is used).
> >> There would be a separate instruction that clears SVMODES to all zero,
> >> to allow calling scalar code quickly.
> > ok so here's where the VBLOCK concept has a clear advantage: that extra
> instruction is not needed. once the VBLOCK context is exited, the
> tear-down is automatic.
> Actually, I think VBLOCK not being able to work on more than a single
> basic-block at a time is a disadvantage compared to SVMODES.
i'd imagine it covering between three to maybe eight instructions... then
setting up a new VBLOCK context on another set of three to maybe eight
instructions. the VBLOCK setup overhead is small enough that it'd be ok,
and still compacting instructions down.
remember that the cascade rules go *well beyond* the initial one/two
instructions. the source and destination register(s) of *both* those first
instructions (up to a maximum of eight, which is a *lot*) can cause pretty
much all of the ongoing instructions to end up being vectorised as well.
my feeling is that the ripple effect will make VBLOCK-SVP be extremely
efficient, making the need for marking more than a couple of instructions
> SVMODES-clear instruction would mostly only be used when calling code
> that is not SVMODES-aware.
yes. understood. still don't like the overhead (which misses the
additional point that makes the overhead of SVMODES-clear moot, which is
that the actual *setup* is more costly than VBLOCK-SVP).
> > * SVMODEs uses CSRs, which is an inherent code-size penalty compared to
> You're missing that SVMODES is changed by each instruction (basically,
> SVMODES values follow values in registers),
i don't understand.
i'd expect that the SVMODEs would be set up, a few instructions called,
then SVMODEs-clear is called. anything else is dangerous, costly, complex,
and has far too many undesirable consequences.
exceptions to that would be static functions within the same source code
so it would basically only
> need to be explicitly written on a context-switch.
[and to safely tear down the context so that the entirety of the GNU/Linux
software base does not require total recompilation to suit this scheme].
now, if we were google, there would be no problem: propose an entirely new
architecture, recompile the OS (Chromium, Android) to suit it, do what we
like. unfortunately, that's not the case, so stepping outside of certain
boundaries (certain knock-on consequences) isn't ok.
> svp.lw x32(vector), (a0), SUBVL=3 # encoded using 8 in the register
> field; sets SVMODES[x8] to SUBVL=3, vector, unpredicated
ret # no tear-down instructions, following code just has to use a
SVPrefix instruction if it uses an uninitialized register, most the
time a register will be written to the first time it's used, so
SVPrefix is not required for those instructions
that is rreeeallly unsafe to do. it's fine if the example was a static
function, used exclusively by other functions in the same source code file,
where the compiler can safely determine that there will be no bleed-out of
the vectorisation cascade.
More information about the libre-riscv-dev