[libre-riscv-dev] [isa-dev] 3D Matrix-style operations / primitives

lkcl luke.leighton at gmail.com
Wed Sep 18 09:34:13 BST 2019


On Wednesday, September 18, 2019 at 4:16:25 PM UTC+8, Jacob Lifshay wrote:
> On Wed, Sep 18, 2019, 01:06 lkcl <luke.l... at gmail.com> wrote:
> On Wednesday, September 18, 2019 at 3:41:49 PM UTC+8, Jacob Lifshay wrote:
> 
> 
> 
> > mv.x with 8-bit indexes sounds like a good idea.
> 
> 
> 
> Yehyeh.  I wonder...
> 
> 
> 
> > 
> 
> > assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
> 
> > using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
> 
> 
> 
> How about 5? 
> 
> 
> 
> ldimm x4, 0x0004080c # transposition indices, packed 8bit
> 
> {SVP.VL=4} MV.X x8, x4, elwidth=8
> 
> {SVP.VL=4} MV.X x9, x4, elwidth=8
> 
> {SVP.VL=4} MV.X x10, x4, elwidth=8
> 
> {SVP.VL=4} MV.X x11, x4, elwidth=8
> 
> 
> 
> You remember the idea we had 8 months ago to make the offsets relative to rd?
> 
> 
> 
> How about 2? :)
> 
> 
> 
> ldimm x4, 0x0004080c # transposition indices, packed 8bit
> 
> {SVP.VL=4,SUBVL=4} MV.X x8, x4, elwidth=8
> 
> 
> 
> Would that even work? Hm it would work if there was a special bit or opcode to apply the offsets to SUBVL looping but not VL
> 
> 
> 
> that could work, but would still take multiple clock cycles per matrix due to the sheer number of registers read.

One of the downsides of MV.X (and vector shuffle in general).

> also, we'd need some serious hw optimizations to mv.x to have it be faster than the 8 cycles/matrix of the swizzle-based transpose (we'd need something like bin-packing or a register-read cache, sounds like huge area and power).

It would be 2R1W per element rather than the usual 1R1W per element.

In the Dependency Matrix Ooo it would be much more like how predication has to work:

* allocate the MVs to Function Units
* have a shadow per MV.X waiting on the rs1 read
* when the read gets the offset out the regfile pass it to the MV.X operation
* release the shadow or raise illegal instruction if the offset was out of range of the regfile

Actually now that I think about it, SUBVL offsets need only be read once.

Now that I think about it, the similarity to LD addressindex mode is so similar it might be possible to share the same hardware.

> i personally think that we should just have a transpose instruction/micro-instruction that has just as much hw as needed. it could just expand to that swizzle sequence internally.

It's a biig frickin pipeline and it's atomic (unless a CSR State reg is used, which then has to be context switched always)

Will record the sequence above.


> 
> Jacob



More information about the libre-riscv-dev mailing list