[libre-riscv-dev] SV Prefix questions

Wed Jun 26 03:32:10 BST 2019

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Tue, Jun 25, 2019 at 9:48 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Tue, Jun 25, 2019, 08:29 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > https://libre-riscv.org/simple_v_extension/sv_prefix_proposal/discussion/
> >
> > jacob: the compiler knows MAXVL at compile-time, however the engine does
> > not, and must be told what it is.  then, loops like the DAXPY one from
> > "SIMD Considered Harmful" will work (even on SV):
> >
> > # a0 is n, a1 is pointer to x[0], a2 is pointer to y[0], fa0 is a
> >   0:  li t0, 2<<25
> >   4:  vsetdcfg t0             # enable 2 64b Fl.Pt. registers
> > loop:
> >   8:  setvl  t0, a0           # vl = t0 = min(mvl, n)
> >   c:  vld    v0, a1           # load vector x
> >   10:  slli   t1, t0, 3        # t1 = vl * 8 (in bytes)
> >   14:  vld    v1, a2           # load vector y
> >   18:  add    a1, a1, t1       # increment pointer to x by vl*8
> >   1c:  vfmadd v1, v0, fa0, v1  # v1 += v0 * fa0 (y = a * x + y)
> >   20:  sub    a0, a0, t0       # n -= vl (t0)
> >   24:  vst    v1, a2           # store Y
> >   28:  add    a2, a2, t1       # increment pointer to y by vl*8
> >   2c:  bnez   a0, loop         # repeat if n != 0
> >   30:  ret                     # return
> >
> > in the above, let's say that MAXVL is 8 and a0 is 1,000.  SETVL goes "vl =
> > t0 = min(8, 1000)" and sets both vl and t0 to 8 as a result.  clearly the
> > next loop will do the same.
> >
> > only when a0 starts to drop *below* 8 will both vl and t0 be set to below 8
> > (MVL).  in RVV, the implementor is allowed to start "throttling" by
> > changing VL to 1/2 the value - this to prevent the "concertina" effect,
> > apparently:
> > https://en.wikipedia.org/wiki/Accordion_effect
> >
> > the thing is: i do not believe that the rules that you set would allow the
> > same behaviour.  i did try creating something like it, a year ago:
> > everything i could think of required either 2 to 3 additional instructions
> > (inside the loop) because there was no way to get the same "vl = rd =
> > min(rs, MVL)" behaviour as a single opcode, or it required a THIRD extra
> > argument to SETVL - addition of the MAXVL value as a parameter (a constant
> > or an immediate).
> >
> Just use setvli as the base instruction, the immediate has 11 bits --
> plenty for MAXVL:
> https://github.com/riscv/riscv-v-spec/blob/master/vcfg-format.adoc
> if we don't want to conflict with the V extension, we could add a custom 32
> or 48-bit instruction.

 conflict is resolved with isamux/isans.  that's not the problem: the
problem is, "going through this exercise exhaustively *at all*".

> >
> > that in turn would require the addition of an extra opcode, which is "off
> > the table" as a hard requirement for SV.
> >
> it may be off the table for SVorig, but I think we should add it to
> SVprefix -- it removes one CSR and reduces a common instruction sequence to
> 1 instruction rather than 2.

 at the cost of needing us to have to spend a week, potentially
longer, doing a full and comprehensive evaluation of pseudo-assembly,
compared to riding off the back off a *lot* of work by a *lot* of
extremely experienced people in the field of Vectorisation.

 if we are to achieve the goal, taking shortcuts by recognising the
expertise and time taken by other experts, and "riding the wave" of
similar work, is absolutely crucial.

> For SVorig, adding a MAXVL CSR seems like the only option.

 aside from the time implications of having to go through the
pseudo-assembly, the implications are too far-reaching, and break a
paradigm that's a fundamental tenet of SV: *no new instructions*.

 to even add *one* instruction as the fundamental basis (i do not mean
"scalar instructions that become parallelised such as if we add
xBitManip") is one too many.

 the reason is very simple: add one instruction, and the *immediate*
requirement is to start adding compiler support and binutils support
for it.

 i tried adding a setvl to binutils, and within minutes of starting i
realised that the consequences - the cost - of doing so would be far
too high.

 it means that we now have to add "maintain a hard fork of gcc
indefinitely" to the already extremely long list of tasks.

 binutils is only updated for debian on a major stable release cycle,
which is every 18 months or so.  i tracked how long it took getting
risc-v support into binutils: it was something like 2 years.

 i appreciate that the Vulkan engine will be independently developed:
the thing is, SV is not just for use in Vulkan.  it's going to have to
go into gcc, and binutils, and ffmpeg (for video) and other
general-purpose software including the linux kernel is going to get
recompiled with it.

> > setting alternative rules, it's *essential* to come up with working demo
> > pseudo-assembly.  after several weeks of thinking about this i gave up and
> > decided to go with something that was very close to RVV.  this not just to
> > save time and effort at the *design* phase, it's also to save on conceptual
> > explanations and also on compiler development.
> >
> The semantics I had defined before you removed the setvl instruction from
> SVprefix

 there was a setvl instruction?  that wasn't clear, at all, sorry.

> (putting a note that setvl was considered for removal would have
> been better, since that way people could still read the definition)

 true.

> are the
> same as the V extension when rs1 is not x0 except that I had the MAXVL <
> rs1 < MAXVL * 2 case set VL to floor(rs1 / 2) instead of ceil(rs1 / 2), I
> think ceil is a better option. MAXVL is the immediate operand. When rs1 is
> x0, the instruction loads VL with MAXVL.

i can't understand or follow that.  as in: i literally can't think
through - at all - how it works or would be used.  these are the rules
from the original:

1. Trap if imm > XLEN.
2. If rs1 is x0, then
    1. Set VL to imm.
3. Else If regs[rs1] > 2 * imm, then
    1. Set VL to XLEN.
4. Else If regs[rs1] > imm, then
    1. Set VL to regs[rs1] / 2 rounded down.
5. Otherwise,
    1. Set VL to regs[rs1].
6. Set regs[rd] to VL.

and i still can't think it through.  they're far too complicated.  why
is the test being divided by 2? why does rule 3 exist? what is the
point and purpose of dividing VL by regs/2? why bother testing against
imm at all?  it just makes absolutely no sense at all.

vl = rd = MIN(rs, MVL) is simple, really clear, very straightforward,
and, crucially, the RVV team have thought it through and given several
worked examples.

can you take a **SHORT** amount of time to write out some pseudo-code
loops: DAXPY and strncpy are two good canonical ones (see appendix,
they'  this is the only way to get to actually understand the full
implications.

 don't for goodness sake take 2 to 7 days on this.  if you have to
spend more than an hour on it, it should be a red flag.

 we *have* to speed up, by thinking through the consequences of
decisions, and drastically chop off avenues that would result in
massive amounts of work.

> The instruction I defined happened
> to share the same mnemonic as the V extension's setvl, but is a separate
> instruction.

 that wasn't made clear, sorry.

 i'll add it to the discussion page.

l.