[libre-riscv-dev] SV Prefix questions

Wed Jun 26 08:50:36 BST 2019

On Wed, Jun 26, 2019 at 7:39 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Tue, Jun 25, 2019 at 10:57 PM Luke Kenneth Casson Leighton
> <lkcl at lkcl.net> wrote:
> >  in addition, there is absolutely no way to use a single P48 or P64
> > instruction to cover LD/ST.MULTI.  how can you, if the value requested
> > to be set into VL is not the *actual* amount that is stored, there?
> Not the case, notice that the value is only different if VL is bigger
> than VLMAX.

 hmmm, it still seems extremely dodgy.

> Note the typo fix lower down.

 got it.  ok so say we have that DAXPY algorithm, the sequence goes
(assume MVL=64):

 a0 = 194 --> VL becomes 64.  scalar code reduces a0 by 64
 a0 = 130 --> VL becomes 64.  ditto
 a0 = 66  --> VL becomes ceil((66+1)/2) = 34??  next loop reduces by 34
 a0 = 32  --> VL becomes 32, loop bnez a0 ends the loop.

 i'll be absolutely honest: what complete ludicrous lunacy!  i totally
get it, in a traditional supercomputer design, where there's heavy
latency between vector reads, vector loads and so on: the accordion
effect *would* have an impact there
https://en.wikipedia.org/wiki/Accordion_effect

 Mitch explained it well when he outlined that way to turn FMACs from
12R4W into 4x "laned" 1R1W by lengthening the pipeline from 5 stages
to *EIGHT*. each FMAC is only allowed to read one of its scalar
operands (r1, r2, r3) in *sequence*, because in that "lane" there is
only the one read slot and only the one write slot.

 the problem was: to exploit the parallelism properly you needed a
whopping *SIXTEEN* element operations to be outstanding in the (4)
pipelines.

 in such architectures, yeah, the allocation becomes a serious problem
that results in architectural "stripmining" at the end of the loop
(not SIMD stripmining which is explicit instructions a la "SIMD
considered harmful").

 we're doing a multi-issue *out-of-order* design that happens to have
a vector front-end.  we do *not* have this problem, because the
multi-issue engine just keeps chugging on whatever operations are
thrown at it.

  in fact, by trying to allocate the last two sets of operations to
the exact same (scalar) register numbers, it could actually
potentially cause problems for the "Q-Table History" feature i am
working on (as in: it would become overloaded).

> So, for example, on RV64:
> if a0 is <= 64 (the max since VL > XLEN won't work due to predication)
> then
> sv.setvl a1, a0, 64
> sets VL to a0 without modification (by V spec rule 1).
>
> sv.setvl a1, x0, 64
> sets VL to 64 (since x0 is treated as rs1=infinity)
>
> in general, to set VL to an immediate value:
> sv.setvl rd, x0, NEWVL
>
> in general, to set VL using a register (when rs1 is smaller than 64):
> sv.setvl rd, rs1, 64

 mmmm ok - it works: i still don't like it.

> in general, when the allocated registers are of size N:
> to set VL to a variable (only does different stuff when rs1 > N):
> sv.setvl rd, rs1, N
> to set VL to a constant:
> sv.setvl rd, x0, N
> >
> >  instead, LD.ST/MULTI is now forced to be a loop.
> not the case

 yep got it.

> to store all registers using sv.setvl:
>
> // assume VL is already saved somehow

 ... or not needed to be

> sv.setvl x0, x0, 64 // set VL to 64 and ignore the result
> svp.sd.vs x64, 512(a0) // store x64-x127 to *(a0 + 64 * sizeof(u64))

 can be done as a single P64 SD as well, btw.  that's more suitable
for function calls (to save local registers)

 ok so it works.

> >  again, to reiterate: i do *not* believe it is a good idea to add
> > actual instructions to SVP.
> Weren't we going to add SVP instructions anyway for the 32-bit
> compressed versions of the 48-bit and 64-bit instructions?!!
> We are also adding all the P48 and P64 instructions.

 (explained separately, the distinction of considering SVPrefix to be
"an instruction": it's not.  it's an *embedding* format, which
*embeds* 32-bit opcodes and gives them "Vector state").

> > >     svp.setvl a3, a2, 48 // VLMAX is 48, since we have space for 48 registers
> >
> > i'd like to be able to suggest using the P64 encoding, here, however,
> > annoyingly, it's the 3-arg case, and the 3-arg case doesn't fit.
> 3-arg fits just fine, even 4-arg fits (fmadd).
> > which is why i split it out into 2 CSRs.
> I'm perfectly fine changing the encoding, I thought I'd just suggest
> one that is available.

 i meant, sorry, i didn't clarify: svp.setvl is the 3 arg case, which
cannot (obviously) fit into a 2-arg CSRRW or a 2-arg CSRRWI.  i was
looking at VLtyp and going "this could be used here in the svp.setvl
opcode, argh no it can't because it can't cover the 3-arg case, you
need an I-Type for that".

> >
> > what, exactly, is "wrong" with having one instruction to set MVL and
> > one to set VL?  yes it's one more instruction, what's wrong with that?
> >  it's not inside the loop.
> >
> > breaking the paradigm "there are no new opcodes" is *really* not to be
> > taken lightly.
> ok, that's fine for SVorig, however SVprefix is all about adding new
> instructions that don't require extensive setup sequences and compiler
> pain. Just because something isn't desired for SVorig doesn't mean
> that we should leave it out of SVprefix.

i view SVP as "a single instruction vector-context-prefixing system"
where VBLOCK is "a multi-instruction vector-context-prefixing-system".

this perspective not only drastically cuts the amount of "thinking" to
be done [there is little to no need to explore all permutations or
even to consider "designing" the P48 or P64 opcode space: it's done.
it's RV 32-bit, plus "prefix". end of story.] it also cuts the amount
of actual hardware design needed as well, *and* creates a much simpler
decode engine.

thus, if an opcode gets added, it's a scalar opcode (32-bit
standard/custom, 16-bit RVC standard/custom), and thus gets "embedded"
in *either* SVP *or* VBLOCK.

thus, also, if we do decide to add sv.setvl, it should be sv*.setvl
*not* svp.setvl.

following on from what you found (and suggested just using RVV OP-V):

Formats for Vector Configuration Instructions under OP-V major opcode

 31 30         25 24      20 19      15 14   12 11      7 6     0
 0 |        zimm[10:0]      |    rs1   | 1 1 1 |    rd   |1010111| vsetvli
 1 |   000000    |   rs2    |    rs1   | 1 1 1 |    rd   |1010111| vsetvl
 1        6            5          5        3        5        7

let me create a page to explore it, gmail's rubbish (even converting
to html doesn't help).

l.