[Libre-soc-dev] 3D MESA Driver

Mon Aug 10 16:28:49 BST 2020

On Sun, Aug 09, 2020 at 04:05:01PM +0530, vivek pandya wrote:
> On Sat, Aug 8, 2020 at 8:34 PM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
> 
> > hi vivek, welcome to libresoc, apologies for the (new) list issues.
> > please when replying (this applies to everyone) always cc vivek on this
> > thread, he is set up "digest" mode.
> >
> > vivek i thought it might be helpful to give a rundown of the libresoc
> > vector system, SimpleV, which is to be "applied" to the POWER9 ISA.
> >
> > the reason is because of your experience and desire to help, your input is
> > valuable as to what instructions actually go into the hardware.
> >
> > normally, as a compiler writer, you would be told by the hardware
> > engineers what was needed: we *do not* want to repeat that.  we therefore
> > would like your active input in the actual hardware, to make ot efficient
> > and effective.
> >
> > so this is a really nice opportunity.
> >
> Thanks, Luke this is very interesting I am tyring to understand.
> 
> >
> >
> >
> > SV is effectively a hardware sub-loop (subcontext) on the standard Program
> > Counter.  it really is no more complex to describe than that.  details
> > however take 7 hours to describe in full (which I did with Alain back in
> > february).
> >
> > the sub-loop which runs from 0 to VL-1 (VL is Vector Length) effectively
> > pauses the PC and issues *multiple* scalar instructions.
> >
> > example:
> >
> > * addi r5, r5, 2 (VL=4)
> >
> > which in POWER9 is "add the immediate 2 to r5" will actually issue:
> >
> > * addi r5, r5, 2
> > * addi r6, r6, 2
> > * addi r7, r7, 2
> > * addi r8, r8, 2
> >
> > and that basically really is all there is to it.  at no time do we have
> > "vector opcodes".  with few exceptions (branch, trap) all *scalar* opcodes
> > *become* vectorised inherently.
> >
> > in hardware terms, at the instruction issue phase, what will actually
> > happen is that the issue engine will notice the 4 (or 8 etc) VL loop, and
> > will "batch" these scalar instructions into SIMD groups.
> >
> > however this will be done *automatically* by the hardware, precisely so
> > that you, as a compiler writer, do not have to massively complicate the
> > compiler (see "SIMD considered harmful" article).
> >
> I am not able to see much perf benefit except following things with this
> idea
> 1) PC increment will be skipped by doing loops in HW.
> 2) Interesting things will be load/store if we have capabilities to
> read/write 128 bits or 256 bits data in one cycle.
> 
> According to me, the compiler's task is not much simplified. As I can see,
> register allocation needs to be aware of VL and for that we need to extend
> the concept of liveness for vectors. (For LLVM we might not need to extend
> but TOT PPC backend can't be used, so we need to modify it to adapt)
> I see that graphics API (vec2/3/4) can be directly mapped to VL.
> And it will be required to modify vectorization pass to convert loops to VL
> of appropriate size.

I suppose it will still be possible to divide the screen into tiles, and 
have a separate C(G)PU do the graphics in each tile -- for further 
parallellism.  This would likely be accomplished entirely in software.

Would this involve expensive data transfer between CPU's, which we are 
trying to avoid by merging the CPU with the GPU?

-- hendrik

> 
> 
> > additional details:
> >
> > * SUBVL
> > * Predication (including a novel concept "twin predication")
> > * element width overrides.
> >
> > # elwidth
> >
> > element width overrides are equivalent, if we think of a register file as
> > a byte level SRAM, as being this:
> >
> > typedef regentry union {
> >     uint64 actual_reg; // scalar
> >     uint8_t b[];
> >     uint16_t h[];
> >     uint32_t w[];
> >     uint64_t d[];
> > } ;
> >
> > regentry int_regfile[128];
> >
> > conceptually here we are relying on the fact that each "actual_reg" is
> > packed and contiguous: the arrays "overrun" and this helps us to understand
> > conceptually what is really actually going on in the hardware.
> >
> > so for scalar operations (when VL is not used) the hardware will
> > read/write to
> >
> >     int_regfile[RA].actual_reg
> >
> >
> > and when VL is active, the b, h, w or d array is accessed instead
> > (deliberately overrunning to other parts of the regile), on that for-loop
> > from 0 to VL-1, depending on whether the 64 (or 32 bit) instruction has had
> > the "elwidth" override set to 8, 16, 32 or "default".
> >
> > (default will use the behaviour of the *instruction*.  some instructions
> > in the v3.0B PowerISA manual actually say they are 32 bit rather than 64.
> > or, they take only 32 bits from the operands, more like).
> >
> > so in this way, we do not have to invent vector instructions add8i,
> > add16i, add32i etc. we can simply use "addi" for all of them by setting
> > elwidth
> >
> This is very interesting. I would like some examples here.
> So if I say addi R4, R4, 2 VL=4 elwidth=8
> will it use 8 bits in R4,R5,R6,R7 ? or 4 input and output will be packed in
> lower 32 bits of R4 (that will require pack/unpack) ?
> 
> >
> >
> # predication
> >
> > predication is done through tagging.  one scalar register is "tagged" as
> > being the predicate.  each bit is then fed to the issue engine.  if the
> > predicate was 0b0110 for the above example using addi, then only the r6 and
> > r7 addi instructions would be done.
> >
> > in reality, in the hardware, the 4-wide SIMD backend will receive 4 bit
> > "chunks" of the predicate, and this will enable/disable parts of that SIMD
> > operation.
> >
> > again, there is no need for you as a compiler writer to do that: you set
> > up the predicate (up to 64 bits at a time) and issue a single operation.
> > hardware takes care of the details.
> >
> >
> > # SUBVL
> >
> > sometimes, especially for vec2/3/4, you want to do loops on vectors of
> > vec2/3/4.  this is what SUBVL is for, and effectively it is a sub-sub-loop
> > on PC, intuitively as might be expected.
> >
> > however one key thing: predicate bits do *not* extend down individually to
> > SUBVL.  they apply to the *whole* vec2/3/4.
> >
> > this saves a lot of bits when setting up predicates.  it would be
> > necessary to do bit level mask manipulation in order to expand 0b0110 into
> > 0000 1111 1111 0000 for an array of vec4 for example and that is costly.
> >
> > Here it will be better to have concrete example.
> 
> >
> > so that is the basics.  it is sufficient to turn any standard scalar ISA
> > into a vectorised one without actually ever having had to design a boatload
> > of vector instructions.
> >
> > more involved however is things like NORMALISE opcodes, CORDIC, and
> > CROSSPRODUCT.  these very definitely are actual vector instructions and
> > consequently are defined in terms of vec2/3/4 as appropriate.
> >
> Agreed.
> 
> >
> > whilst the preliminary work has started here on these vector ISA
> > operations, your input would be particularly welcome (when we are not under
> > timepressure for the Oct 2020 deadline).
> >
> 
> > adding scalar IEEE754 opcodes such as SIN, COS, LOG1P, it should be
> > naturally obvious that adding these to the POWER9 ISA means that
> > automatically they will also become vectorised through SV.
> >
> > in essence SV is about massively reducing the complexity of the work
> > needed by everyone: binutils, compiler writers, simulators and hardware.
> >
> > l.
> >
> >
> >
> > --
> > ---
> > crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
> >
> >
> _______________________________________________
> Libre-soc-dev mailing list
> Libre-soc-dev at lists.libre-soc.org
> http://lists.libre-soc.org/mailman/listinfo/libre-soc-dev