[libre-riscv-dev] [isa-dev] [RFC] SV branch behaviour: augmentation to store results of conditional tests
lkcl
luke.leighton at gmail.com
Wed Sep 25 06:24:21 BST 2019
On Tuesday, September 24, 2019 at 8:00:17 PM UTC+1, Bruce Hoult wrote:
>
> On Tue, Sep 24, 2019 at 6:36 AM lkcl <luke.l... at gmail.com <javascript:>>
> wrote:
> >
> > i wanted to run by people an augmentation of standard scalar RISC-V
> branch, which is a little complex to explain.
> > https://libre-riscv.org/simple_v_extension/appendix/#standard_branch
> >
> > with SV, there are no vector opcodes, only scalar ones that are
> "augmented" with a hardware-level for-loop, and may - if the hardware
> implementor chooses - be parallelised.
> >
> > with two possible registers in a branch operation to provide "tag"
> context (src1 and src2), a lot can be done to provide augmentation options.
> the general idea is: if there are going to be multiple elements being
> compared, then, well:
> >
> > (a) make them predicated and
> > (b) store the results of the comparisons and
> > (c) change the decision on whether to "branch" to be dependent on *all*
> of the comparisons and
> > (d) add fail-on-first data-dependency which can terminate the
> comparisons early
> >
> > where (c) can be modified to be one of 4 decisions:
> >
> > * all-tests-zero (NAND)
> > * all-tests-one (AND)
> > * at-least-one-test-is-zero (NOR)
> > * at-least-one-test-is-one (OR)
>
> You appear to have reversed NAND and NOR.
>
ah, thank you - logic dyslexia kicking in.
Google "consensual branches". This is a standard feature in SIMT
> architectures, though seldom publicly documented. Yunsup's theses is a
> good reference.
>
> https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-117.pdf
page 22. "cbranch.ifnone" instruction. says "they're not publicly
described" (in NVIDIA GPUs - nothing really is! i can't find *anything* on
the internal NVIDIA ISA!)
yunsup mentions that popcount followed by scalar branches is equivalent, as
is "thread-level predication" (was that the technique that has gotten
Samsung's GPU into a bit-of-a-mess, can you recall what Mitch said about
that, on comp.arch?)
holy cow, a search "cbranch.ifnone" provides almost *nothing*! this was
the only discussion i could find:
http://lists.llvm.org/pipermail/cfe-dev/2015-January/041098.html
ok section 5.5 of yunsup's paper, says that only "cbranch.ifnone" and
"cbranch.ifall" exist, and that the lack of "cbranch.ifany", *two* branches
in a row are needed: a cbranch.ifnone followed by an unconditional jump.
"A new instruction would improve performance as it would decrease
instruction count and make unrolling easier by eliminating a branch
instruction from the middle of an unrolled region"
so it looks like, by accident / design, i'm describing / advocating the
same thing.
> When you convert SIMT to vectors it becomes a test on predicate masks.
>
> In RVV I expect this would turn into using VPOPC and then comparing
> against 0 or VL.
>
hmm that sounds like it would work, except it also sounds expensive, and,
also, loses information at the point at which it was generated.
also, if the comparison itself was predicated, then it would be necessary
to compare not against VL, but against VPOPC(mask):
mcount = VPOPC(mask)
v.cmp m0, v1, v2, mask
rcount = VPOPC(m0)
BNE/BEQ mcount, rcount, @branchpoint
where in SV it's _literally_ just "BGE" [vector-context-setup aside].
side-note: the same trick can't be deployed using C.BNEZ or C.BEQZ because
SV needs *two* context-tagged registers to pull off this trick (one for the
mask, one to store the comparison results).
from what yunsup is saying, although it's not commonly-known,
consensual-branches could improve performance *if* the full four set is
available.
to that end, would it be worthwhile explicitly adding to RVV? (two reasons
why i ask: 1. because the team doing RVV analysis have far more expertise
and resources to hand and 2. if there is direct-equivalence in SV then our
team has a hell of a lot less work to do).
to achieve direct equivalence, the cbranch.ifnone/all/any/some operation
would need to itself be (optionally) predicated. interestingly, it would
need to take a vector mask register as src1, rather than an int/FP vector,
because vector/FP comparison instructions write to a destination mask
register.
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-integer-comparison-instructions
the cbranch operation could then be macro-fused with vms*** and vmf***
operations.
More information about the libre-riscv-dev
mailing list