[libre-riscv-dev] [llvm-dev] RFC: First-class Matrix type

Thu Oct 11 10:11:29 BST 2018

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Oct 11, 2018 at 9:49 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Thu, Oct 11, 2018 at 1:27 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> > [off llvm-dev list, link/ref here:
> > http://lists.llvm.org/pipermail/llvm-dev/2018-October/126873.html]
> >
> > On Thu, Oct 11, 2018 at 8:59 AM Jacob Lifshay <programmerjake at gmail.com>
> > wrote:
> > >
> > > This sounds like it would be really useful for 3D Graphics APIs as
> > SPIR-V (the Vulkan/OpenCL2.1 intermediate representation) has matrices as
> > first-class types. This will promote round-trip-ability. I have also heard
> > that RISC-V's V extension wants to support matrices (not sure if it's as an
> > additional extension on top of V or as part of V proper).
> >
> >  if you can pseudo-code an example instruction, or load/store, that
> > would be great.
> >
> an example from SPIR-V would be OpMatrixTimesMatrix, on p. 158 of
> https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.pdf
> I would have to refresh my memory on the exact order of operations in a
> matrix multiplication in order to write pseudo-code.
>
> Here's an implementation of 4x4 by 4x4 matrix multiply from the glm library
> (almost-canonical math library for 3D graphics in C++):
> https://github.com/g-truc/glm/blob/6f6f4d3ae8df98f0145a575f383e0387f03b8626/glm/detail/type_mat4x4.inl#L630

 ok cool.  i did A-level maths, i remember matrix-multiply.

 this algorithm's more obvious (explicit), from here
 https://www.programiz.com/python-programming/examples/multiply-matrix

# iterate through rows of X
for i in range(len(X)):
   # iterate through columns of Y
   for j in range(len(Y[0])):
       # iterate through rows of Y
       for k in range(len(Y)):
           result[i][j] += X[i][k] * Y[k][j]

 also, at university i did an algorithm for being able to do massive
matrix multiply in restricted memory sizes.  it kinda cheated: just
use brute-force pre-analysis based on assumption that load/store is
much more expensive than calculating.

 well.... in assembler it would be possible to use the
vector-pointer-variant of LD, with an offset.  set up an array of
pointers to the start of each row, then use the offset as the column.
two 4x4 matrix multiplications is 16+16+16 values, which if they're
32-bit, that's 24 RV64 regs.  with 128 regs, now, that's doable.

 i *might* have thought of a way to do it, by allowing a "single-step"
mode on the parallel-hardware-loop, and flipping between four of those
"STATE" CSRs i mentioned.  a "single-step" mode would, instead of
completing the entire loop, just run once and only once, update the
element offsets in the current "STATE" and leave it at that.

 the single-step mode could be for the outer loops.  i believe by
reordering (reversing the order) of the two inner loops it will be
possible to do a vectorised version (i.e. parallelise that
result[i][j] +=)

i *think*.... i *think*... a combination of the "reordering" idea, the
vector-pointer-LD *and* the single-stepping might do the trick.

> >
> >  i did consider last night adding a "remap" option to the register
> > file.  so whereas normally you do this:
> >
> >  VL=3
> >
> >  add r10, r20, r30
> >
> > =>
> >
> >  add r10, r20, r30
> >  add r11, r21, r31
> >  add r12, r22, r33
> >
> > it would be (assuming elwidth=8):
> >
> > for i in range(VL):
> >    r10[i] = r11[i] + r12[i]
> >
>
> You would have to explain the semantics of r10[i] in more detail as I don't
> know what you meant by that.

 sorry, it would mean assuming - in c - a declaration:

 uint8_t[8] r10;

 and if elwidth is 16-bit, it would be

 uint16_t[4] r10;

 so, better:
    union reg_t {
        uint8_t[8] b;
        uint16_t[4] u;
        uint32_t[2] w;
        uint64_t[1] v;
   }

  reg_t r10, 11, 12

  if (elwidth == 8)
     for i in range(VL):
        r10.b[i] = r11.b[i] + r12.b[i]
 elif elwidth == 16)
     for i in range(VL):
        r10.u[i] = r11.u[i] + r12.u[i]
 etc.

l.