[libre-riscv-dev] SV / RVV, marking a register as VL.

Fri Aug 30 17:55:24 BST 2019

Op donderdag 29 augustus 2019 10:23:16 UTC+2 schreef lkcl:
>
>
>
> On Thursday, August 29, 2019 at 8:26:30 AM UTC+1, Rogier Brussee wrote:
>>
>> First of all see Bruce Hoult's remark: the whole issue may be moot and 
>> yet another layer of redirection seems meh.
>>
>
> shaving one instruction off of a 12-instruction loop is not to be sneezed 
> at, rogier!  and in SV, it's something like a reduction of 3 in 13, which 
> is a whopping 20% reduction!  one of those is on the loop-critical-path (an 
> 11% reduction) and the others are on the clean-up path.
>

For this example.

>  
>

> if the design principles of RISC and RISC-V are to be respected and 
> followed, small reductions in code size are significant, and big reductions 
> even more so.
>
>
see may. It is a matter of weighing cost and benefit. 

>
> ideas:
>> * I could imagine a CSRRA[I] (CSR read and add [immediate]) instructions 
>> complementing the "bitwise" CSR instructions. Problem is of course where to 
>> put that because the CSR number is big. There seems to be room in the 
>> CSR/func3 == 0b100 minor opcode for an immediate version, but the 
>> privileged spec seems to be a heavy user of the CSR/func==000 however 
>> (albeit all with rd = x0), which makes it a bit awkward to also have a 
>> CSRRA instruction  :-(.
>>
>
> "here be dragons"... if you have one CSR being allowed this kind of 
> special treatment (arithmetic) pretty soon there will be calls for yet more 
> arithmetic operations.  at that point the ISA has a duplication of the 
> *entire* suite of arithmetic operators.
>

CSRRA would be allowed on all CSR's, just as  CSRRS and CSRRC are allowed 
on all CSR registers, they would just not necessarily be useful. I assume 
here that adds are more useful than anything else except what is available 
now. Also if you go the road of trying to squeeze in the CSR/ privileged 
opcode you have no room left for anything else :-(.  

In the slightly crazy idea category: I seem to recall that what CSRRS and 
CSRRC do exactly depends on the CSR and they are allowed to do "strange 
things" like nothing at all or setting bits that are implied by others. 
After all the semantics of a CSR instruction strongly depends on what CSR 
is. So you may be allowed (standard wise, but more importantly, as far as 
tool chains are concerned) to think of CSRRS and CSRRC as just two (atomic) 
operations that you can fill in to your hearts desire, and if for your VL 
CSR you think it is more useful for CSRRS and CSRRC to mean  (atomic) ADD 
and SUB rather than (atomic) OR and ANDN, more power to you. 

(What seems standard WARL usage of a CSR is  

CSRRCI rd MVL  -1 # rd = platform dependent max vector length,  MVL = 0: 
 to indicate vectorisation is _on_ 

CSRRSI zero MVL -1  #  MVL = platform dependent max vector length. 

)

> CSRs were never intended for this kind of close-knit arithmetic tie-in.  
> you set them up, you maybe clear a bit or two, do lots of operations, and 
> then maybe set or clear a bit or two again.
>
>
Right. But what is the fundamental difference between atomically set/clear 
a bit and atomically adding and subtracting. 

> VL *completely* breaks that rule, right from the SETVLI implementation 
> (VL=MIN(rs1, MVL)), and fail-on-first even more so.  fail-on-first not only 
> has a read-dependency on the VL CSR, it has a *write* dependency as well.
>
> this is the core of the argument for special-case treatment of VL (and 
> making it an actual scalar register): as a CSR its use goes well beyond 
> that for which CSRs were originally designed.
>
> whereas... if SETVLI is modified to set up a *pointer* to a scalar 
> register, *now* the VL CSR is more along the lines of how CSRs were 
> intended to be used.  set them up once to change the behaviour (and leave 
> them alone), do some tightly-dependent arithmetic work, then reset them.
>
>  
>
>> *As above, but just have an R-type instruction that only add's to the VL 
>> CSR. 
>>
>
> again, i'd be concerned at the special treatment.  once you want ADD, 
> someone else will want MUL.  and DIV.  and... etc. etc.
>
>  
>
>> *If you could mmap the CSR file,  you could use the AMO-ops to manipulate 
>> them, in particular use add and subtract (and max and min!). 
>>
>
> iinteresting.  i've mulled over the idea of mapping the CSR regfile SRAM 
> into the actual global memoryspace before.   the architectural implications 
> (and power consumption due to the load on the L1 cache) had me sliiightly 
> concerned.
>
> mind you, for 3D, we need separate pixel buffer memory areas and so on so 
> it's a problem that has to be solved.
>
> worth thinking through, some more, i feel.
>
>
> *Ditch the idea that a VLCSR has to specify a VL registers but simply use 
>> one register for VL by convention (t1= x6 or t2 = x7 ???) and use it 
>> implicitly,  just like sp ra are used implicitly in the C instructions, 
>> allowing to specify the VL register in the 64(?)  bit wide "allow to 
>> specify everything" version of your instructions. This, of course, 
>>  requires specifying you are in vector mode in other ways then VL != 1 if 
>> you want to use implicit vectorisation.
>>
>  
> i kinda like it, however mentally i am rebelling at the lack of 
> orthogonality.  allocating one register to VL means it's effectively 
> removed from use in all other circumstances...
>
>
It sort of depends on how you look at it. You are perfectly free to use say 
 t1 as vl register for any other purpose as long as you don't do SV stuff, 
just like you are perfectly free to use ra for any other purpose, but of 
course if you  C.JAL it gets implicitly trashed, and you have to save and 
restore the value on the stack if you call a function.  In particular there 
is no compatibility issue. 

> ... and if one register is allocated, you still have to have the 
> dependency-tracking on that (one) scalar register, and if you have 
> dependency-tracking on one scalar register (as a "hidden" VL) you might as 
> well go the whole hog and go orthogonal.
>
>
I don't quite see what you mean here with dependency tracking but you seem 
to have already decided that there is no dependency. My mental model is 
that an instruction " expands" to an instruction with an explicit  vl 
register input which just happens to have a conventional number ((just like 
sp in C.LDSP).

that said: from what i saw of the statistical analysis of register-usage by 
> gcc that WD did, many of the registers x1-x31 have near-zero percentage 
> utilisation, so something at the high end of the regfile numbering probably 
> wouldn't be missed.
>

I don't know exactly how you have arranged things, but if you address 
registers in blocks, having a block of 16 registers that can be used 
alternatively as scaler registers with the standard instructions is useful 
and made more difficult if one of those registers is used as vl, whereas a 
scalar use of t1 can be replaced with a temporary in the  x16-x31 (e.g. t3) 
without any problem. This is why t1 or t2 which are just as hard or easy to 
use as x31 is still better than x31.  

>
> however if you do that (x31 for example), use of RVC instructions is out 
> of the question.  and if you _do_ allocate one of the registers accessible 
> by RVC (x8-15) you just took out a whopping 12.5% of the available 
> registers for use by RVC.
>
>
> It would seem the important instructions to use with vl are C.ADDI, C.LI, 
C.MV, B.MAX and  perhaps  occasionally C.LDSP, C.SDSP, and C.ADD and SUB. 
 Each of the RVC instructions use a 5 bit register number.  

> with all these things in mind - the VL CSR using the CSR regfile for ways 
> in which it was never originally designed being the most crucial - is the 
> idea of having VL be a pointer-to-a-scalar-reg starting to make more sense?
>
>
No because even if redirection is free, once the vector length is in a 
register and not in a CSR, I don't really see what it buys you to be able 
to set _different_ registers as the vl register, (even if you specify the 
vector length in a parameter of a function just doing mv t1 a1 is shorter). 
 I think such a CSR would (almost?) always point to a conventional register 
like t1. Also, it is a CSR worth of state that has to be saved on context 
switches.  Being able to making the vl register dependency explicit by 
making it explicitly part of the long "specify everything version" of your 
instructions seems "the right thing" though.

This cannot be uncoupled from a calling convention.

Anyway I was just giving you food for thought which it seems to have done 
:-).

Rogier

> l.
>
>