[libre-riscv-dev] [llvm-dev] RFC: First-class Matrix type

Thu Oct 11 11:10:58 BST 2018

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Oct 11, 2018 at 10:11 AM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:

> > Here's an implementation of 4x4 by 4x4 matrix multiply from the glm library
> > (almost-canonical math library for 3D graphics in C++):
> > https://github.com/g-truc/glm/blob/6f6f4d3ae8df98f0145a575f383e0387f03b8626/glm/detail/type_mat4x4.inl#L630
>
>  ok cool.  i did A-level maths, i remember matrix-multiply.
>
>  this algorithm's more obvious (explicit), from here
>  https://www.programiz.com/python-programming/examples/multiply-matrix
>
> # iterate through rows of X
> for i in range(len(X)):
>    # iterate through columns of Y
>    for j in range(len(Y[0])):
>        # iterate through rows of Y
>        for k in range(len(Y)):
>            result[i][j] += X[i][k] * Y[k][j]

 ok so i *think* by transposing Y (using LD-pointer) which would mean
in that last line it would be Yt[j][k] not Y[k][j], you would have
rows of k for X and rows of k for Yt, and that would clearly be a
straight vector-row-multiply.

 i think basically, having an actual vector multiply operator, or
having 2D or 3D support, should be something to look at in future, and
that as long as transposing of the 2nd matrix is acceptable (with
associated load/stores) it's possible to do.

 if we have a separate scratch RAM (which was planned anyway) then the
hit on L1/L2 cache is not an issue.

> > You would have to explain the semantics of r10[i] in more detail as I don't
> > know what you meant by that.
>
>  sorry, it would mean assuming - in c - a declaration:
>
>  uint8_t[8] r10;
>
>  and if elwidth is 16-bit, it would be
>
>  uint16_t[4] r10;
>
>  so, better:
>     union reg_t {
>         uint8_t[8] b;
>         uint16_t[4] u;
>         uint32_t[2] w;
>         uint64_t[1] v;
>    }
>
>   reg_t r10, 11, 12
>
>   if (elwidth == 8)
>      for i in range(VL):
>         r10.b[i] = r11.b[i] + r12.b[i]
>  elif elwidth == 16)
>      for i in range(VL):
>         r10.u[i] = r11.u[i] + r12.u[i]
>  etc.

 i realised afterwards that this is actually what's planned to be implemented :)

i believe i was thinking along the lines of skipping every other
element, i.e. having a 2D version of VL.  VL1 and VL2 in effect.

l.