[Libre-soc-isa] [Bug 560] big-endian little-endian SV regfile layout idea

Tue Jan 5 00:48:23 GMT 2021

https://bugs.libre-soc.org/show_bug.cgi?id=560

--- Comment #56 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #55)
> (In reply to Luke Kenneth Casson Leighton from comment #53)
> > (In reply to Alexandre Oliva from comment #50)
> > > do you see now why it doesn't make sense that the conversion from BE to LE
> > > (or vice versa) places the MSByte in the LSByte?
> > 
> > deep breath: it doesn't matter if it "makes sense", it's what the actual
> > code - the simulator, the HDL of microwatt and the HDL of Libre-SOC -
> > actually do.
> 
> Yes, and Alexandre and I are saying that the CPU should be changed for the
> software reasons explained previously:

the hardware is already so complex and this introduces another dimension of
complexity that it is going to be one of those things that is simply not a good
idea to continue discussing for implementation at this time.

changes involving the register files are, as you have noted many times already,
where i put my foot down and say "no" due to the inherent complexity involved
in even beginning to assess, starting from the discussion and escalating from
there.

i have also made it clear a number of times how far behind we are and how
urgent it is that fundamental architectural design changes stop being added. 
"leaf node" ideas not a problem: massive impacting core design changes like
this one, they should have been brought up 18 months ago.

we are unfortunately here hindered further in the discussion by not having
nailed down a common frame of reference.

i did a quick check: both x86 and ARM do not do this.  ARM NEON LD/ST can
perform byteswapping based on endianness: they do NOT allow the endianness
optionally to propagate to the ALUs.  x86 SIMD chose LE and that's that.

> registers should keep their contents conceptually in the data endian mode
> the cpu is currently in,

the cost to an architecture in doing that is just insane.  registers
effectively need to be "tagged" (context propagated and saved or inferred
somehow) and/or byteswapped prior to use in the ALU.

this is insane.  to cover all SIMD permutations every register port needs an
8-to-8 8-bit crossbar in front of it, and we have a huge number of regfile
ports.

that is not happening (as in: i am saying: it's not going to happen)

> all arithmetic operations should byteswap from the
> registers' endian mode to LE (or whatever endian the ALU is implemented in)
> for operations byteswapping at the element-size for the operation, then back
> to the data endian mode to store the results in registers.

no.

absolutely no way.  due to the ALUs being SIMD you're asking for full 64 bit
8x8 crossbars on either every FU or on every register file port, because it's
not just 64 bit that needs 8-byte swapping, it's *all possible permutations* of
8, 16, 32 and 64 bit data that need bytereversing.

it is 100% the case that that is not goung into the design or the silicon.

> In BE mode, all of those byteswaps swap between BE in the registers and LE
> for the ALUs.

no.  far too costly both in gate count, assessment time, evaluation time,
specification writing time: everything about this screams "no".

> In LE mode, all of those byteswaps swap between LE in the registers and LE
> for the ALUs -- here the byteswaps are actually no-ops.

wires.  that's how it's going to be.

> Switching between BE and LE mode by flipping the mode bit in the appropriate
> SPR will byteswap all registers at 64-bit width in order to keep their
> values for OpenPower v3.x compatibility.

if this was remotely workable (not insanely costly a gate count) i would say
"yes, and we defer it"

however due to us doing SIMD ALUs it is not even the case that a straight 64
bit mux swapping order of 8 bytes or not can do the job: it can't.

what you're asking for is a full 8-in 8-out crossbar and the gate count on such
is so enormous (10 gates per 2x2 crossbar, 3 layers, 64 bits, a total of 1920
gates *per regfile port* and we will easily end up with well north of 30-40
regfile ports) that the answer has to be no.

the reason why it is acceptable in LD/ST is because the byteswapping is
isolated to the LDST units.

*not* on the front of every single regfile port.

answer: no.  this is not going to happen.

i'll wait for alexandre, so that you have opportunity to understand here how
bytereversing works in the HDL and removes the ordering at the memory layer.

after that i would like this bugreport firmly closed as "WONTFIX" or "INVALID".

if there exist any architecture that does this type of register-exposed
byteswapping in the SIMD units then it can be reopened.

preliminary indications are that absolutely nobody does this.  which, in terms
of gcc support, makes it far more work to support such an architectural
augmentation rather than less.

ARM NEON does the exact same trick that has been described in the SV spec:

https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Application-Level-Memory-Model/Endian-support/Endianness-in-Advanced-SIMD?lang=en

note how the diagram works.

* BE memory load performs byteswapping.  data is stored in the regfile in LE

* LE memory load is straight.  data is stored in the regfile in LE.

in both cases data is stored in LE, and processed in LE.  this is the sane way
to do it.  preserving the memory order and expecting the ALUs to cope is insane
and costly (and without precedent in the industry and in gcc)

with NEON doing SIMD regs only in LE, and with x86 doing SIMD regs only in LE,
these are precedents that allow exploration or how their support in gcc is
implemented.

-- 
You are receiving this mail because:
You are on the CC list for the bug.