[libre-riscv-dev] [isa-dev] 3D Matrix-style operations / primitives

Wed Sep 18 09:16:10 BST 2019

On Wed, Sep 18, 2019, 01:06 lkcl <luke.leighton at gmail.com> wrote:

> On Wednesday, September 18, 2019 at 3:41:49 PM UTC+8, Jacob Lifshay wrote:
>
> > mv.x with 8-bit indexes sounds like a good idea.
>
> Yehyeh.  I wonder...
>
> >
> > assuming a vector of 4x4 matrixes is stored as 4 separate vectors with
> subvl=4 in struct-of-array-of-struct form (the form I've been planning on
> using):
> > using standard (4+4) -> 4 swizzle instructions with 2 input vectors with
> subvl=4 and 1 output vector with subvl, a vectorized matrix transpose
> operation can be done in 2 steps with 4 instructions per step to give 8
> instructions in total:
>
> How about 5?
>
> ldimm x4, 0x0004080c # transposition indices, packed 8bit
> {SVP.VL=4} MV.X x8, x4, elwidth=8
> {SVP.VL=4} MV.X x9, x4, elwidth=8
> {SVP.VL=4} MV.X x10, x4, elwidth=8
> {SVP.VL=4} MV.X x11, x4, elwidth=8
>
> You remember the idea we had 8 months ago to make the offsets relative to
> rd?
>
> How about 2? :)
>
> ldimm x4, 0x0004080c # transposition indices, packed 8bit
> {SVP.VL=4,SUBVL=4} MV.X x8, x4, elwidth=8
>
> Would that even work? Hm it would work if there was a special bit or
> opcode to apply the offsets to SUBVL looping but not VL
>

that could work, but would still take multiple clock cycles per matrix due
to the sheer number of registers read. also, we'd need some serious hw
optimizations to mv.x to have it be faster than the 8 cycles/matrix of the
swizzle-based transpose (we'd need something like bin-packing or a
register-read cache, sounds like huge area and power).

i personally think that we should just have a transpose
instruction/micro-instruction that has just as much hw as needed. it could
just expand to that swizzle sequence internally.

Jacob