[libre-riscv-dev] [isa-dev] Re: SV / RVV, marking a register as VL.

Mon Sep 2 15:27:15 BST 2019

On Monday, September 2, 2019 at 1:04:31 PM UTC+1, Jacob Lifshay wrote:
>
> > if the same idea was applied *outside* of a VBLOCK, there would be no 
> such guarantee.  hmm, so that would need to be thought through. 
>
> I am proposing that the vector/scalar flags would cascade throughout 
> the whole function/multiple functions,

single function, no problem.  multiple functions: now it's an order of 
magnitude more complex.  the cascade now becomes part of the contract for 
the function API.  that's an entire academic research project, all on its 
own.

we *need* to keep this simple and straightforward, jacob.

> which can be handled in the 
> compiler with a rather simple data-flow analysis. 

it's nowhere neeeear simple when branch/control is involved.

> You could think of 
> it as tagging the registers with a vector/scalar flag.

yes.  same principle for the Mill (except they tag with bit-width and other 
things).

ok, so let's say that there's a function which, on entry, the first thing 
it does is a data-dependent branch.

what should the engine do, when cascading the registers *from following the 
program counter*?

it's going to follow the branch on one call of the function, which will 
cause a completely different set of registers to become cascade-tagged as 
"vectorised", when compared to a subsequent call which goes the other 
direction, isn't it?

now what happens when there are six nested loops and multiple 
data-dependent paths, as there is in the MP4 decode algorithm?

do you imagine that a compiler would be able to cope with that?

if it was successfully patched, what do you rate the chances of successful 
upstream acceptance of such a patch?

i'd rate the chances of the success of such an effort at close to zero.

so let's instead go with a sequential ordering from the numbering (address) 
on the instructions.  instead, the cascade occurs in *instruction* order, 
as opposed to "execution" order.

now we have a hardware-level nightmare: it's now necessary to keep track of 
the program counter, perhaps read a dozen or perhaps even hundreds of 
instructions, tracking them all in order to create and maintain the cascade 
order.

the complexity is off the scale for one scenario (in hardware), and off the 
scale in a different direction (in compiler technology) for the other.

> this is where it gets wasteful.  the number of permutations that raise 
> illegal instruction traps is so high that it suggests that the encoding is 
> not a good one.  i feel it would be better to use the 2 bits for elwidth. 
>
> I think that SUBVL will actually be more useful there than elwidth, 
> though those both will be very useful. The majority of graphics 
> shaders use only i32/u32 and f32, whereas most of them use a range of 
> SUBVL values. 
>
>
we're not just covering 3D graphics: SV is for Video Processing, numerical 
computation, and a lot more.  (the meetup at WD on thursday gave some great 
feedback, and someone specifically asked for Video Processing to be 
included).

with scalar RISC-V entirely missing compact 8-bit and 16-bit operations, 
elwidth overrides are a sane way to get them (in bulk i.e. vectorised).

remember that it's perfectly possible to call SV.SETVL within a VBLOCK.  or 
CSRRWI SUBVL.

also, that setting SUBVL in a VBLOCK will set it globally.  in the example 
you gave, i believe that the entirety of the individual "SET SUBVL" marks 
may be replaced with one single global SET SUBVL, right at the top of the 
function.

can you think of an example where SUBVL would need to change hugely and 
frequently, *and* where it would be sub-optimal to use VBLOCK-SVP 
"grouping" of opcodes (when compared to it being mandatory to call 
SVMODES-clear at the end of a function *or* before actually *calling* a 
function)?

> 128 bits worth of context-saving... 
>
> not too bad since, if we design it right, saving/restoring SVMODES can 
> be skipped for most system calls, it would only need to be 
> saved/restored for context switches between processes. 
>
>
to give some context (haha), the reason why i designed VBLOCK in the first 
place was because i considered 128 bits worth of CSR setup in SV-Orig to be 
far too much.

in addition, if the assumption is that system calls will use registers that 
are not involved in the cascade, that's a *really* dangerous assumption 
(and/or requires an entire recompilation of pretty much every single 
distro's source code to make sure that it _is_ a correct assumption).  
and/or requires setting certain boundaries (such as not utilising registers 
x1-x31) for VBLOCKs, which then limits the entire purpose of the exercise.

keeping things "isolated" to a single function (or to isolated functions in 
the same source code file that cannot be called externally) is the safest 
and simplest thing to do, and that means *not* spilling the cascade-context 
outside of the places where it's used.

think about it: one register happens to be marked as "vectorised", run on a 
function that is supposed to be scalar, it'll destroy the entire function 
by using registers as vectors that were never intended to be used as such.

it's just not safe, jacob, and expecting the entirety of the GNU/Linux 
world to recompile 30,000 packages is not a reasonable expectation, 
either.  as a hybrid processor, it has to be "compatible" within the 
confines of the UNIX Platform Spec.

> >> on return from an exception, SVMODES would be restored (by copying 
> >> back from the saved csr or by switching which SVMODES csr is used). 
> >> 
> >> There would be a separate instruction that clears SVMODES to all zero, 
> >> to allow calling scalar code quickly. 
> > 
> > 
> > ok so here's where the VBLOCK concept has a clear advantage: that extra 
> instruction is not needed.  once the VBLOCK context is exited, the 
> tear-down is automatic. 
>
> Actually, I think VBLOCK not being able to work on more than a single 
> basic-block at a time is a disadvantage compared to SVMODES. 

i'd imagine it covering between three to maybe eight instructions... then 
setting up a new VBLOCK context on another set of three to maybe eight 
instructions.  the VBLOCK setup overhead is small enough that it'd be ok, 
and still compacting instructions down.

remember that the cascade rules go *well beyond* the initial one/two 
instructions.  the source and destination register(s) of *both* those first 
instructions (up to a maximum of eight, which is a *lot*) can cause pretty 
much all of the ongoing instructions to end up being vectorised as well.

my feeling is that the ripple effect will make VBLOCK-SVP be extremely 
efficient, making the need for marking more than a couple of instructions 
moot.

the 
> SVMODES-clear instruction would mostly only be used when calling code 
> that is not SVMODES-aware. 
>

yes.  understood.  still don't like the overhead (which misses the 
additional point that makes the overhead of SVMODES-clear moot, which is 
that the actual *setup* is more costly than VBLOCK-SVP).

>
> > * SVMODEs uses CSRs, which is an inherent code-size penalty compared to 
> VBLOCK-SVPrefix. 
>
> You're missing that SVMODES is changed by each instruction (basically, 
> SVMODES values follow values in registers), 

i don't understand.

i'd expect that the SVMODEs would be set up, a few instructions called, 
then SVMODEs-clear is called.  anything else is dangerous, costly, complex, 
and has far too many undesirable consequences.

exceptions to that would be static functions within the same source code 
file.

so it would basically only 
> need to be explicitly written on a context-switch. 

[and to safely tear down the context so that the entirety of the GNU/Linux 
software base does not require total recompilation to suit this scheme].

now, if we were google, there would be no problem: propose an entirely new 
architecture, recompile the OS (Chromium, Android) to suit it, do what we 
like.  unfortunately, that's not the case, so stepping outside of certain 
boundaries (certain knock-on consequences) isn't ok.

>
> svp.lw x32(vector), (a0), SUBVL=3 # encoded using 8 in the register 
> field; sets SVMODES[x8] to SUBVL=3, vector, unpredicated 
>

ret # no tear-down instructions, following code just has to use a 
SVPrefix instruction if it uses an uninitialized register, most the 
time a register will be written to the first time it's used, so 
SVPrefix is not required for those instructions 

that is rreeeallly unsafe to do.  it's fine if the example was a static 
function, used exclusively by other functions in the same source code file, 
where the compiler can safely determine that there will be no bleed-out of 
the vectorisation cascade.

l.