[libre-riscv-dev] [isa-dev] Re: SV / RVV, marking a register as VL.

Mon Sep 2 09:52:02 BST 2019

On Monday, September 2, 2019 at 3:32:01 AM UTC+1, Jacob Lifshay wrote:

> What about taking the vector/scalar propagation idea and just apply it 
> to normal instructions (outside of VBLOCK): 
>

i like it.  so let's do a comparison of the resource utilisation of each, 
and how it would work.

the premise of the VBLOCK-Prefix idea: up to 2 SVPrefixes apply to the 
first (and second) instruction.  the prefixes do not specify the register 
*numbers* (unlike SVOrig), they specify "src1 is a scalar/vector" and the 
actual register number is picked up from the [first] instruction.

reminder: the cascade basically says that any register number which is 
"marked" as vectorised from the first [and second] instruction, those 
register numbers *remain* marked as vectorised and, in subsequent 
instructions, if the same register number is further used (as a src) it 
*remains marked as a vector*.  furthermore, the result register from the 
subsequent instructions *also get marked as vectorised* and so on.  hence a 
"cascade".

The Mill Architecture also implements this concept, so it is not a new 
idea.  The Mill however starts from LDs.  only "LD" instructions actually 
specify the "type" and bit-width of the operand!  all arithmetic operations 
are subsequently polymorphic (there is no ADD.W, there is only ADD.  there 
is no FADD, there is only ADD), and consequently the instruction encoding 
is extremely compact.  it's very cool.

as branches [except back to the start] and function calls are prohibited 
within the VBLOCK, there are guarantees that the order in which the 
compiler can calculate which registers will cascade through is completely 
static.

if the same idea was applied *outside* of a VBLOCK, there would be no such 
guarantee.  hmm, so that would need to be thought through.

there would be a 128-bit CSR (64-bit on RVE) with a 4-bit field for 
> each register field encoding [1], tentatively named SVMODES. It is 
> allocated to normal CSRs in the same manner as the time CSR -- by 
> splitting into XLEN-sized chunks. 
>

oh hang on - 128/4 is 32, are you suggesting a table-entry *per register*? 
the way that VBLOCK-SVPrefix works is that the register numbers are taken 
from the following instructions, and cascade onwards.  which saves a huge 
amount of CSR space.

i would be extremely surprised if more than 4 possibly 8 sets of prefixing 
was needed, given the cascading.  marking the *entire* set of registers in 
advance, there is no need for the cascade (at all).

reading further below i am guessing you're probably not.  this would reduce 
the required size of the CSR signiificantly.

let's assume a max of 4 regs: rd, rs1, rs2, rs3.  that would be 4x4=16 
bits.  let's assume say... 4 sets of those (up to 4 instructions in 
sequence may be "marked").  that's 64 bits.  if you wanted to cover up to 8 
instructions, yes that would be 128 biits.

so the cost (number of instructions) needed to set up the CSRs would be:

* li x3, immediate # 64-bit immediate: a LD?
* CSRRW CSR1, x3
* li x3, immediate # another 64-bit immediate LD
* CSRRW CSR2, x3

if LD is used for the load-immediate, that's still a whopping *128* bits 
worth of setup, with other aspects of SVPrefix (extending the register 
numbers beyond 32, being able to use different predicate registers on 
different instructions) still needs to be included.

by contrast, the VBLOCK prefix is only 16 bits, and whilst only up to 2 
instructions may be "marked", only between 16 to 64 bits is used to do so, 
*and* it covers setting of VL/MVL, and targets multiple predicate 
registers, and so on.

[1]: not just register number, other extensions may have a register 
> field encoding to register number translation table, such as SVorig. 
>
> each field would be encoded as follows: 
> bits 0-1: 
>     00: SUBVL=1 
>     01: SUBVL=2 
>     10: SUBVL=3 
>     11: SUBVL=4 
>

mmm... SUBVL is supposed to be "global", however... what would be implied 
here is that SUBVL would be applied *per register*.  exactly how that would 
work would need to be thought through.

[oh, ok, you cover it, below].

>
> bits 2-3: 
>     00: scalar 
>     01: vector unpredicated 
>     1x: vector (predication -- TBD) 
>

element-width overrides are also necessary (2 bits to specify 
default/8/16/32).  that would i feel be better than changing the semantics 
of SUBVL (from a global to a per-register).

> the 4-bit field corresponding to rd would be written with the vector 
> mode of the result of each instruction, which is calculated from the 
> instruction and the vector modes as follows: 
> 1. If the instruction specifies scalar/vector mode in the encoding 
> (like via SVPrefix), then the mode specified by the instruction is 
> used. 
> 2. Otherwise, 
>     a. the scalar/vector mode is calculated for each of the source 
> operands by reading from the SVMODES field corresponding to the 
> register field encoding for that source operand. 
>

... and cascades from there.

>     b. If the SUBVL modes from the source operands don't match and the 
> instruction doesn't specifically handle differing SUBVL modes, then 
> trigger an illegal instruction trap (swizzle can handle diffferent 
> SUBVL modes, most other instructions can't)

this is where it gets wasteful.  the number of permutations that raise 
illegal instruction traps is so high that it suggests that the encoding is 
not a good one.  i feel it would be better to use the 2 bits for elwidth.

        2. The result predication is TBD (probably selected similarly 
> to SUBVL selection). 

twin-predication also needs to be thought through.  i think it would work 
though.

> There would be a separate SVMODES csr used for context switching, on a 
> trap, the old value would be saved (by switching which SVMODES csr is 
> used or by copying to a different csr), then the SVMODES csr would be 
> cleared to all zero (all registers scalar with SUBVL=1). 
>

128 bits worth of context-saving... 

>
> on return from an exception, SVMODES would be restored (by copying 
> back from the saved csr or by switching which SVMODES csr is used). 
>
> There would be a separate instruction that clears SVMODES to all zero, 
> to allow calling scalar code quickly. 
>

ok so here's where the VBLOCK concept has a clear advantage: that extra 
instruction is not needed.  once the VBLOCK context is exited, the 
tear-down is automatic.

> Since using SVMODES instead of all SVPrefix instructions just makes 
> the code smaller and faster, but doesn't allow using any more 
> instructions than before, it could be used like a more complicated way 
> to compress instructions and it would be possible to limit all SVMODES 
> handling to after the register allocator in the compiler, similar to 
> how RVC instructions can be substituted entirely in the assembler, 
> without the compiler needing to know (though knowing allows selecting 
> more optimal code at the expense of a more complex compiler). 
>
>
this is the logic behind VBLOCK-SVPrefix.  a compiler could (conceivably) 
simply output SVPrefix instructions, and a second-phase optimiser simply 
goes through them, spots which registers share the same prefixes, works out 
how to cascade them and *replaces* multiple SVPrefix instructions with a 
cascading VBLOCK-SVPrefix instead.

so comparing the two:

* SVMODEs uses CSRs, which is an inherent code-size penalty compared to 
VBLOCK-SVPrefix.
* SVMODEs requires tear-down instructions (another code-size penalty)

these are, for me, the "killers" if the focus is to be on reducing code 
size (and thus I-Cache usage and thus power consumption).  the overhead of 
the CSR setup sequence was why i came up with VBLOCK in the first place: 
VBLOCK-SVPrefix simply continues to take advantage of that opportunity.

l.