[Libre-soc-isa] [Bug 567] Allow transparent scalar loads and stores to/from registers allocated as vectors

Tue Jan 5 18:50:56 GMT 2021

https://bugs.libre-soc.org/show_bug.cgi?id=567

Luke Kenneth Casson Leighton <lkcl at lkcl.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lkcl at lkcl.net

--- Comment #2 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Alexandre Oliva from comment #1)
> My recommendation doesn't involve changing loads or stores in any way.
> 
> My recommendation is that the svp64 iteration over sub-register vector
> elements obeys the in-memory array/vector layout.

ok in standard vector terminology this is named "unit strided", and the
pseudocode is as follows (i am drastically simplifying, taking out elwidths,
predication, LE/BE, everything, stripping it back to fundamental basics):

    function op_ld(rd, rs) # LD not VLD!
      for (int i = 0, int j = 0; i < VL && j < VL;):
        # unit stride mode
        srcbase = ireg[rs] + i * 8; // assume 64 bit elwidth here
        ireg[rd+j] <= mem[srcbase + imm_offs];
        i++;
        j++;

note that that is not:

    function op_ld(rd, rs) # LD not VLD!
      for (int i = 0, int j = 0; i < VL && j < VL;):
        # unit stride mode "LDR" mode
        #                    vvvvv element-inversion done here
        srcbase = ireg[rs] + (VL-1)-(i * 8);
        ireg[rd+j] <= mem[srcbase + imm_offs];
        i++;
        j++;

that would be NEON LDR as described here:
https://llvm.org/docs/BigEndianNEON.html

>  I.e., if you load a
> vector with smaller-than-64-bit elements into a register with a dword load,
> and iterate over them with svp64, you visit them in the same order you would
> if iterating over array elements in memory, and in the same order you would
> if the register held a struct with an array, and you iterated over the
> elements in it.

this example - 64-bit loads followed by placement into smaller-width elements -
would result in truncation of the data, picking only the lower numbered bytes,
context being that the regfile SRAM is defined as being LE numbered/ordered
from 0 upwards, and the elements being indexed and ordered as LE from 0
upwards.

the pseudocode will therefore be as follows (assume src_elwidth=8 to indicate
64-bit reads):

    function op_ld(rd, rs, brev) # LD not VLD! (ldbrx if brev=True)
      for (int i = 0, int j = 0; i < VL && j < VL;):

        # unit stride mode, compute the address
        srcbase = ireg[rsv] + i * src_elwidth;

        # takes care of (merges) processor LE/BE and ld/ldbrx
        bytereverse = brev XNOR MSR.LE

        # read the underlying memory
        memread <= mem[srcbase + imm_offs];

        # optionally performs 8-byte swap (because src_elwidth=8)
        if (bytereverse):
            memread = byteswap(memread, src-elwid)

        # takes care of inserting memory-read (now correctly byteswapped)
        # into regfile underlying LE-defined order, into the right place
        # within the NEON-like register, respecting destination element
        # bitwidth, and the element index (j)
        set_polymorphed_reg(rd, dest_bitwidth, j, memread)

        # increments both src and dest element indices (no predication here)
        i++;
        j++;

> This means that while in data LE mode the iteration goes as you wrote under
> "One choice is", whereas in data BE mode the iteration goes as you wrote
> under "Another choice is".

ah.  this is incorrect (stemming likely from a misunderstanding of how
unit-striding works).  the description that Cesar gave of "Another choice is"
is in fact an 8-byte hard-coded BE numbering-reordering/meaning of the regfile
SRAM (which we are not doing), we are doing it as:

* LE-ordered elements
* LE-numbered bytes in the underlying SRAM
* LSB0

taking Cesar's original elements, and giving LE "meaning" to the bytes *in*
each element, by adding b0 and upwards to indicate them, this comes out as, if
i can put it in tabular form:

  byte: 0      1       2     3       4     5       6     7
  bit:  0-7    8-15    16-23 24-31   32-39 40-47   48-55 56-63

 8-bit  L0     L1      L2    L3      L4    L5      L6    L7      8x 8bit
16-bit  {L0.b0 L0.b1} {L1.b0 L1.b1} {L2.b0 L2.B1} {L3.b0 L3.b1}  4x 16bit
32-bit  {L0.b0 L0.b1   L0.b2 L0.b3} {L1.b0 L1.b1   L1.b2 L1.b3}  2x 32bit
64-bit  {L0.b0 L0.b1   L0.b2 L0.b3   L0.b4 L0.b5   L0.b6 L0.b7}  1x 64bit

where within each element, b0 refers to LSByte numbering to mean LSByte and the
meanings b1, b2, b3 should follow self-evidently from there.

this is how NEON is encoded: this is how we are doing it.

what Cesar describes under "the other choice" (the one that we are NOT doing)
is that the numbering of each of the 8 bytes is inverted (BE-encoded):

  byte: 7      6       5     4       3     2       1     0
  bit:  56-63  48-55   40-47 32-39   31-24 16-23   8-15  0-7

(and the elements remain in the exact same positions: i.e. merely the top-level
bit-numbering changes from LE to BE).

or, another way to view it: the byte/bit columns have stayed the same, but it
is the ordering of the L* that has swapped (hard-coded anything in column 0 has
moved to column 7 and vice-versa, 1 swapped with 6, 2 with 5 and 3 with 4).

let us call this (the one we are not doing) "MSByte0" or more specifically
"MSByte0 ordering of 8x byte values" because in each batch of 8 bytes in the
underlying SRAM, under "The other choice", it's the lowest-numbered byte that
is the MSByte.

note that that is NOT:

        63-56 ......                              7-0

that is termed "MSB0 numbering" and we are NOT doing MSB0 numbering either
("upto" in VHDL terminology).

now.  let us take the case where memory is LDed.  this is going to be a bit
challenging, let me think how to do it... ok i think i got it.  we'll define
memory as byte-ordered:

   M0 M1 M2 M3 M4 M5 M6 M7.... Mnn

then, let us do a 64-bit src-width LD, unit strided, VL=2, and set
dest-width=32.  let us also use a LD operation, and set MSB.LE=1.  this will
DISABLE memory-level byteswapping (as expected).  our assignments then occur
(using the way we *are* going to be doing the regfile) as:

* L0.b0 = M0
* L0.b1 = M1
* L0.b2 = M2
* L0.b3 = M3

(that covers element 0, where the 64-bit data M0-M7 was truncated as a
LE-converted quantity, and stored into a 32-bit element - L0 - in LE byte
order).

next:

* L1.b0 = M8
* L1.b1 = M9
* L1.b2 = M10
* L1.b3 = M11

this is element 1, where, again, the 64-bit memory read M8-M15 was truncated
then stored into a 32-bit element - L1 - in LE byte order.

now, if you used "The Other choice", that would end up as:

* L0.b0 = M11
* L0.b1 = M10
* L0.b2 = M9
* L0.b3 = M8

* L1.b0 = M3
* L1.b1 = M2
* L1.b2 = M1
* L1.b3 = M0

which will, if we choose this meaning (MSByte0), become utterly confusing: far
worse than IBM choosing MSB0.  Cesar describes this as "wiring" which is
technically correct however it is a wiring that we know from experience of
dealing with MSB0 is absolutely dreadful.  every access to the SRAM will
require an insertion of hard-coded "7-idx" in front of it, or a hard-coded
8-byte reversal Cat(list.reverse()).

there is no need for any of that in the HDL if we simply pick LE array and
arithmetic ordering, because that's what nmigen does naturally.

now, we *could* pick this numbering as one of the following:

* just a numbering designation (just some wires) but it's such hell to
understand
  that we would be effectively smashing our heads against concrete walls and
  then trying to think (i.e. impossible)

* **ACTUAL** inversion - **ACTUAL** byte-inversion.  this would be, frankly,
  too stupid to even contemplate, as it would literally change the definition
  of the VL for-loop depending on the elwidths.

either way, neither of these things is happening.

leaving that aside, let us do BE on our chosen (NEON-like) regfile arrangement.
 let us set:

* ld, LE=0 (BE mode), keep everything else the same: srcwid=64, dest=32, VL=2

in the way that we are doing the regfile - the way that NEON does it - this
would be:

* L0.b0 = M7
* L0.b1 = M6
* L0.b2 = M5
* L0.b3 = M4

checking against this image from wikipedia:
https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Big-Endian.svg/250px-Big-Endian.svg.png

(needs adjusting to 64-bit but you get the idea).

because the loads are 64-bit, therefore M0 contained the MSByte of the 64-bit
numerical value, and M7 contained the LSByte of that same numerical value. 
thus, after truncation, L0.b0 contains M7 and so on.  *NOT* M3-M0 because that
is the HIGH word of the underlying numerical 64-bit value.

correspondingly, when we get to element 1:

* L1.b0 = M15
* L1.b1 = M14
* L1.b2 = M13
* L1.b3 = M12

now we can finally see where the source of confusion about "The other choice"
(MSByte0) is.  can you see that the ordering of the bytes with "The Other
Choice" are absolutely *nothing* like those of the BE-ordered ld operation?

this despite the fact that the LD was a 64-bit LD.

i trust that this helps explain why NEON-like LE numbering for both elements
and the bytes in the regfile has been chosen?

if that's clear then this bugreport can be closed as INVALID rather than
DEFERRED, because there's nothing to change, no spec changes needed (only
clarification).

-- 
You are receiving this mail because:
You are on the CC list for the bug.