[libre-riscv-dev] [isa-dev] SV / RVV, marking a register as VL.

Thu Aug 29 08:53:53 BST 2019

On Wednesday, August 28, 2019 at 11:00:12 PM UTC+1, Bruce Hoult wrote:
>
> Have a register (or CSR) contain some sort of pointer to another 
> register? Just: no way. Micro-architectural nightmare. 
>

that's what i thought, initially: it's why i paused for a long time before 
raising the idea.  then it occurred to me that

(a) there's only one of them (VL is global) so the contents may be cached 
and
(b) in an efficient OoO design the CSRs *are* a register file which 
requires dependency-management anyway and
(c) the implications of the CSR-register-containing-a-pointer is just 
another dependency hazard

in addition, both predication and MV.X (regs[rd] = regs[regs[rs1]]) require 
pretty much exactly the same microarchitectural dependency hardware to be 
in place.  in the case of "CSR-register-is-a-pointer", the actual vector 
length is obtained via "regs[CSRregs[VL]]" which is near-identical to MV.X

[MV.X is the scalar equivalent of the vector-indexed move operation]

so a good vector engine will already *have* the required concepts / 
hardware in place and/or have to solve near-identical microarchitectural 
design issues anyway.

in-order sytems, the one-stop-shop solution to everything of course is 
"stall, stall, stall"... :)

The scalar instructions in, for example, this strncpy loop do not take 
> significant time. In a real version of the code they would be 
> interleaved with vector instructions rather than all at the end,

that's *if* the vector engine is a separate one from the scalar engine.  
some embedded low-cost solutions may not have a separate ALU, for example 
in embedded 3D.  you'll meet some of the people for whom such a 
microarchitectural design decision will be critical, tomorrow.

that having been said...

> and 
> would on almost all machines be completed long before the preceding 
> vector instruction is. In particular the move from the VL CSR would 
> happen soon after the vlbff.v and the increments to the pointers soon 
> after that.
>

...ok, great: so in an in-order system, clashes (dependency hazards) would 
be long gone by the time the CSR-pointing-to-the-register had been 
established.

as long as the code had been arranged so that the VL CSR pointer-setup was 
well in advance of its use.

> Maybe something like: 
>
>
i like this example.  it's really elegant.

strncpy: 
 mv a3, a0 # Copy dst 
loop: 
 setvli x0, a2, vint8 # Vectors of bytes.

ok so here there's a dependency: VL has a read dependency on a2. x0 is not 
written to, so there's no write dependency created.

vlbff.v v1, (a1) # Get src bytes 

this instruction has both a read *and* write dependency. v1 has a read 
dependency on VL, and because VL is written to it creates a second ongoing 
write-dependency.

vseq.vi v0, v1, 0 # Flag zero bytes

... which occurs here. so here, VL's (new value) creates a read dependency 
on both v0 and v1.

csrr t1, vl # Get number of bytes fetched

here, VL's new value from vlbff creates a read dependency on the scalar 
register, t1.  so there's one potential cycle's "grace" in an in-order 
system where stall would not occur.  as these are are not complex 
operations i'd be really surprised if significant latency was required, no 
matter what the microarchitecture.

the rest of the assembly code is straightforward apart from the 
modification to a2 and looping back to where a2 is used...

vmfirst a4, v0 # Zero found? 
 add a1, a1, t1 # Bump src pointer 
 vmsif.v v0, v0 # Set mask up to and including zero byte. 
 sub a2, a2, t1 # Decrement count. 
 vsb.v v1, (a3), v0.t # Write out bytes 
 add a3, a3, t1 # Bump dst pointer 
 bgez a4, exit # Done 
 bnez a2, loop # Anymore? 

... here - and it was set up (written to) over 5 instructions ago as far as 
the entrance to the next loop iteration is concerned.  that's still a 
write-dependency, however, which in a seriously-fast out-of-order design 
may result in tripping the dependency hardware.

so, let's go over it again, this time with the hypothetical 
VL-points-to-a-scalar-reg augmentation.

strncpy: 
 mv a3, a0 # Copy dst 
loop: 
 setvli t1, a2, vint8 # Vectors of bytes.

note that t1 is now the target.  this says - hypothetically - that t1 *is* 
VL.

so here there's a dependency: VL has a read dependency on a2. *t1* has a 
write dependency created on whatever is going to use it in the near future

vlbff.v v1, (a1) # Get src bytes

this instruction has both read *and* write dependencies. v1 has a read 
dependency not on VL, but on *t1*, and because *t1* is written to it 
creates a second ongoing write-dependency... *on t1*.

vseq.vi v0, v1, 0 # Flag zero bytes

... which occurs here. so here, t1's (new value) creates a read dependency 
on both v0 and v1.

# NO LONGER NEEDED csrr t1, vl # Get number of bytes fetched

t1 has *already* been set up with the required value [this is the (one) 
instruction in the loop that is saved, reducing the loop count in RVV by 
around... 8% or so].

again: the rest of the assembly code is straightforward apart from the 
modification to a2 and looping back to where a2 is used...

vmfirst a4, v0 # Zero found? 
 add a1, a1, t1 # Bump src pointer 
 vmsif.v v0, v0 # Set mask up to and including zero byte. 
 sub a2, a2, t1 # Decrement count. 
 vsb.v v1, (a3), v0.t # Write out bytes 
 add a3, a3, t1 # Bump dst pointer 
 bgez a4, exit # Done 
 bnez a2, loop # Anymore?

again, t1 is all "read" here (not written to) so again, the only concern is 
that a2 had been written to 5 instructions up, which, on the loop (on very 
fast systems) that will create a write hazard back at the setvli, just as 
with the current revision of RVV.

so, honestly i'm not seeing anything unsurmountable, here.  if i haven't 
missed anything, my feeling is that a good dependency-tracking system will 
have the necessary hardware in place, and an in-order system is going to be 
using stall, stall, stall anyway.

does that look reasonable?

l.

> exit: 
>     ret 
>
> On Tue, Aug 27, 2019 at 12:45 PM lkcl <luke.l... at gmail.com <javascript:>> 
> wrote: 
> > 
> > https://libre-riscv.org/simple_v_extension/appendix/#strncpy 
> > 
> https://libre-riscv.org/simple_v_extension/specification/sv.setvl/#questions 
>
>