[libre-riscv-dev] [Bug 139] Add LD.X and ST.X? Strided

Mon Oct 7 10:48:53 BST 2019

http://bugs.libre-riscv.org/show_bug.cgi?id=139

--- Comment #49 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
Apologies I hadn't realised quite how important swizzling really is.

https://libre-riscv.org/simple_v_extension/vblock_format/#swizzle_format

I have been looking at the PLX 3D paper and it contains an algorithm for 4x4
matrix times 4x1 vector.

That algorithm is:

fmul f2, f1.xxxx, f10
fmac f2, f1.yyyy, f11, f2
fmac f2, f1.zzzz, f12, f2
fmac f2, f1.wwww, f13, f2

VBLOCK swizzle table format can cope with this in a single block by setting a
swizzler onto four registers that are *redirected* to f1, each with a different
swizzle setting.

Macro op fusion would result in *doubling* the number of instructions.

Both are not ideal.

For this particular case however I am inclined to review the decision to put
the REMAP CSR on the back burner.

https://libre-riscv.org/simple_v_extension/remap/

These were intended for Matrices, however I forgot about them after thinking
that Vector Mul was not as high a priority.

Swizzle looks to be extremely awkward and costly, making the REMAP CSRs
attractive by comparison.

With the right REMAP, setting

* SHAPE1 to operate on a 4-element continuous loop and attached to f2
* SHAPE2 to wait 4 elements before incrementing by 1, and attaching to f1

the Matrix Multiply is LITERALLY reduced to 2 instructions, one of which is to
clear out f2 to 4 zeros, the other is an FMAC with a VL of 16 (no SUBVLs).

VL could be set with an SVP-64 instruction, no need to set up a VBLOCK.

The alternative is to add REMAP to VBLOCK.

-- 
You are receiving this mail because:
You are on the CC list for the bug.