[libre-riscv-dev] [isa-dev] [RFC] SV branch behaviour: augmentation to store results of conditional tests

Wed Sep 25 06:24:21 BST 2019

On Tuesday, September 24, 2019 at 8:00:17 PM UTC+1, Bruce Hoult wrote:
>
> On Tue, Sep 24, 2019 at 6:36 AM lkcl <luke.l... at gmail.com <javascript:>> 
> wrote: 
> > 
> > i wanted to run by people an augmentation of standard scalar RISC-V 
> branch, which is a little complex to explain. 
> > https://libre-riscv.org/simple_v_extension/appendix/#standard_branch 
> > 
> > with SV, there are no vector opcodes, only scalar ones that are 
> "augmented" with a hardware-level for-loop, and may - if the hardware 
> implementor chooses - be parallelised. 
> > 
> > with two possible registers in a branch operation to provide "tag" 
> context (src1 and src2), a lot can be done to provide augmentation options. 
>  the general idea is: if there are going to be multiple elements being 
> compared, then, well: 
> > 
> > (a) make them predicated and 
> > (b) store the results of the comparisons and 
> > (c) change the decision on whether to "branch" to be dependent on *all* 
> of the comparisons and 
> > (d) add fail-on-first data-dependency which can terminate the 
> comparisons early 
> > 
> > where (c) can be modified to be one of 4 decisions: 
> > 
> > * all-tests-zero (NAND) 
> > * all-tests-one (AND) 
> > * at-least-one-test-is-zero (NOR) 
> > * at-least-one-test-is-one (OR) 
>
> You appear to have reversed NAND and NOR. 
>

ah, thank you - logic dyslexia kicking in.

Google "consensual branches". This is a standard feature in SIMT 
> architectures, though seldom publicly documented. Yunsup's theses is a 
> good reference. 
>
> https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-117.pdf
page 22.  "cbranch.ifnone" instruction.  says "they're not publicly 
described" (in NVIDIA GPUs - nothing really is!  i can't find *anything* on 
the internal NVIDIA ISA!)

yunsup mentions that popcount followed by scalar branches is equivalent, as 
is "thread-level predication" (was that the technique that has gotten 
Samsung's GPU into a bit-of-a-mess, can you recall what Mitch said about 
that, on comp.arch?)

holy cow, a search "cbranch.ifnone" provides almost *nothing*!  this was 
the only discussion i could find:
http://lists.llvm.org/pipermail/cfe-dev/2015-January/041098.html

ok section 5.5 of yunsup's paper, says that only "cbranch.ifnone" and 
"cbranch.ifall" exist, and that the lack of "cbranch.ifany", *two* branches 
in a row are needed: a cbranch.ifnone followed by an unconditional jump.

"A new instruction would improve performance as it would decrease 
instruction count and make unrolling easier by eliminating a branch 
instruction from the middle of an unrolled region"

so it looks like, by accident / design, i'm describing / advocating the 
same thing.

> When you convert SIMT to vectors it becomes a test on predicate masks. 
>
> In RVV I expect this would turn into using VPOPC and then comparing 
> against 0 or VL.
>

hmm that sounds like it would work, except it also sounds expensive, and, 
also, loses information at the point at which it was generated.

also, if the comparison itself was predicated, then it would be necessary 
to compare not against VL, but against VPOPC(mask):

mcount = VPOPC(mask)
v.cmp m0, v1, v2, mask
rcount = VPOPC(m0)
BNE/BEQ mcount, rcount, @branchpoint

where in SV it's _literally_ just "BGE" [vector-context-setup aside].  
side-note: the same trick can't be deployed using C.BNEZ or C.BEQZ because 
SV needs *two* context-tagged registers to pull off this trick (one for the 
mask, one to store the comparison results).

from what yunsup is saying, although it's not commonly-known, 
consensual-branches could improve performance *if* the full four set is 
available.

to that end, would it be worthwhile explicitly adding to RVV?  (two reasons 
why i ask: 1. because the team doing RVV analysis have far more expertise 
and resources to hand and 2. if there is direct-equivalence in SV then our 
team has a hell of a lot less work to do).

to achieve direct equivalence, the cbranch.ifnone/all/any/some operation 
would need to itself be (optionally) predicated.  interestingly, it would 
need to take a vector mask register as src1, rather than an int/FP vector, 
because vector/FP comparison instructions write to a destination mask 
register.
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-integer-comparison-instructions

the cbranch operation could then be macro-fused with vms*** and vmf*** 
operations.