[libre-riscv-dev] SV / RVV, marking a register as VL.
luke.leighton at gmail.com
Fri Aug 30 19:25:05 BST 2019
On Saturday, August 31, 2019 at 12:55:25 AM UTC+8, Rogier Brussee wrote:
> shaving one instruction off of a 12-instruction loop is not to be sneezed at, rogier! and in SV, it's something like a reduction of 3 in 13, which is a whopping 20% reduction! one of those is on the loop-critical-path (an 11% reduction) and the others are on the clean-up path.
> For this example.
Yes. Which is "the" canonical example of sequential data-dependent fail-on-first parallel processing.
In SV we're down to 8 RVC opcodes within the loop and 2 on the exceptional exit path. Use of VBLOCK which contains the SETVL and the Vectorisation Context requires a 64 bit overhead.
[Yes, SV can vectorise RVC opcodes]
> CSRRA would be allowed on all CSR's, just as CSRRS and CSRRC are allowed on all CSR registers, they would just not necessarily be useful. I assume here that adds are more useful than anything else except what is available now. Also if you go the road of trying to squeeze in the CSR/ privileged opcode you have no room left for anything else :-(.
> CSRs were never intended for this kind of close-knit arithmetic tie-in. you set them up, you maybe clear a bit or two, do lots of operations, and then maybe set or clear a bit or two again.
> Right. But what is the fundamental difference between atomically set/clear a bit and atomically adding and subtracting.
That's slightly missing the point: the point is that the scalar registers are what you're supposed to do arithmetic on, and CSRs are what are supposed to change the behaviour of the engine, run a bunch of arithmetic ops, then switch it off again.
The CSRs are supposed to be "pushed" at the ALUs in a one-way fashion. Things like setting the FP CSR for example. Setting a mode for arithmetic saturation and so on.
The only reason you are "supposed" to read CSRs for is in unusual circumstances, such as context switches.
Vector data-dependent fail-on-first completely throws that rule out the window.
*in the middle of the loop*, VL is modified by the vbff opcode, giving it read modify write dependency requirements, and tying in the ALUs closely with instruction issue *and* CSR management, in the process.
Where before (without vbff) those dependencies on the VL CSR were read only and thus complexity _was_ greatly reduced but isn't any more.
Clearly This Is Bad :) or, could be better.
Given that data paths and dependency tracking already exist between Vector and scalar instructions, making VL "be" a scalar register quite dramatically simplifies the microarchitectural implementation rather than complicates it.
> *Ditch the idea that a VLCSR has to specify a VL registers but simply use one register for VL by convention (t1= x6 or t2 = x7 ???) and use it implicitly, just like sp ra are used implicitly in the C instructions,
> It sort of depends on how you look at it. You are perfectly free to use say t1 as vl register for any other purpose as long as you don't do SV stuff,
Ah. We've switched contexts. I was trying to keep this mostly to RVV. However the idea applies to both.
Ok for SV, as it uses standard scalar opcodes, there is a lot more pressure on the scalar register number allocation. Even though the regfiles have been extended to a whopping 128 each, they are only accessible by specifying "context" that modifies the 5 bit field in *scalar* opcodes (or 3 bits in RVC).
Thus if one of those is allocated hardwired to VL it has much more serious consequences for SV than for RVV.
> > ... and if one register is allocated, you still have to have the dependency-tracking on that (one) scalar register,
> I don't quite see what you mean here with dependency tracking
Read and write hazards on registers.
> but you seem to have already decided that there is no dependency.
Ah if you are referring to my last message, I concluded that just by reading the *pointer* of course does not affect the *data* (in this case a scalar reg),
Which now sounds obvious :)
The point was then not that the *data* deoendencies are gone (clearly if the reg-that-VL-points to is modified, you just changed VL so there is definitely a hazard there), the point was that there are no data dependencies by use of a *CSRR* instruction, which is subtly different.
> My mental model is that an instruction " expands" to an instruction with an explicit vl register input which just happens to have a conventional number ((just like sp in C.LDSP).
Yehh the case for C.LDSP was I feel slightly different. Hilariously (and OT), SV can actually redirect x2 (and override the element width), making C.LDSP useful for purposes beyond its original intent.
> I don't know exactly how you have arranged things, but if you address registers in blocks, having a block of 16 registers that can be used alternatively as scaler registers with the standard instructions is useful and made more difficult if one of those registers is used as vl,
Not at all. We have a Dependency Matrix logic block, which tracks all read and write register hazards. No pipeline *ever* needs to stall. evvvvverrrrr. Once data goes in, the pipeline *knows* that there will be a place for the result to go.
The VL dependency tracking, which had to be in there anyway, is now no longer a "special case", it's now just another scalar register.
Now, that happens to be a "hidden extra operand" to all opcodes, but that's exactly how VL has to be thought of anyway: a hidden operand that is implicitly added to every [vector] instruction.
> whereas a scalar use of t1 can be replaced with a temporary in the x16-x31 (e.g. t3) without any problem.
See above. SV is under a lot more register allocation pressure.
> It would seem the important instructions to use with vl are C.ADDI, C.LI, C.MV, B.MAX and perhaps occasionally C.LDSP, C.SDSP, and C.ADD and SUB. Each of the RVC instructions use a 5 bit register number.
It's not the VL-related-arithmetic ops that worry me, it's that because SV uses *all* scalar opcodes and contextually marks them as "vectorised", the *vector* operations are under pressure if one of the regs is hard coded to VL.
It leaves compiler writers with far less flexibility. Whatever reg is picked, it creates a hole around which the use of the surrounding scalar regs *cannot be used in a vector*.
It is just a route I do not want to go down.
> with all these things in mind - the VL CSR using the CSR regfile for ways in which it was never originally designed being the most crucial - is the idea of having VL be a pointer-to-a-scalar-reg starting to make more sense?
> No because even if redirection is free, once the vector length is in a register and not in a CSR, I don't really see what it buys you to be able to set _different_ registers as the vl register,
I believe I explained that, above, it is down to the unique design properties of SV, the scalar regfile *is* the Vector regfile.
By overloading multi-issue execution semantics, SETVL tells the instruction issue phase how many sequentially numbered contiguous *scalar* operations are to be thrown at the *scalar* ALUs.
Making those actually SIMD ALUs is a microarchitectural optional optimisation.
Even having any parallelism *at all* is an optional performance optimisation.
3D is a real pain, because the core ops involve 4x FP32 or 3x, and the pressure on a standard 32 entry regfile is ridiculous. The power consumption penalty of extra LD/ST is just completely unacceptable, so large regfiles are critical.
Normal 3D ISAs go with special 64 bit instructions. This instead creates pressure on executable size and the I-cache in a hybrid CPU / GPU.
A proprietary GPU wouldn't give a monkey's, the I cache and D cache and entire memory arrangement is radically different.
Being able to use RVC and being able to "prefix" RVC scalar opcodes to create one-off (non-VBLOCK) 32 bit vector opcodes is I feel really important for the ultra low power GPU space.
> Also, it is a CSR worth of state that has to be saved on context switches.
It has to be saved anyway.
> Being able to making the vl register dependency explicit by making it explicitly part of the long "specify everything version" of your instructions seems "the right thing" though.
On balance... yeah.
> Anyway I was just giving you food for thought which it seems to have done :-).
Yes, for which I am very grateful.
I'd like to see RVV be similarly improved through public transparent discussions, for the benefit of all implementors and of the RISC-V Vector community.
More information about the libre-riscv-dev