[libre-riscv-dev] GPU design

Wed Dec 5 11:59:38 GMT 2018

On Wed, Dec 5, 2018 at 9:52 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> It might actually make the compiler simpler since we wouldn't have as many
> different kinds of register to allocate. I don't think you'll have any
> context thrashing, unless you have too many processes, which would thrash
> on any processor anyway. Other than the standard int and fp registers, the
> rest could be all caller saved registers, most the actual code that needs
> the upper registers will be calling other code in the same shader, which we
> can optimize by using a different calling convention in the jit-compiled
> code.

 it's making me nervous, as it's a huge change to SV.  it's not
something that can be transparent, it needs to be part of the SV
specification.

 honestly i'd far rather we implemented a processor that had 128 FP
regs and only 64 INT, rather than have (or allow) overlaps, especially
dynamically.

 jacob bachmeyer already alerted me to the god-awful mess that was
intel SSE: the compiler hell (including bugs) that resulted from SIMD
overloading of the FP register file actually had to be "recovered"
from by... providing a totally new replacement for SSE.

> You could think of it as transitioning from a disk with 2 partitions to 1
> partition, the filesystem can now just allocate any block on the whole disk
> rather than being limited to half the disk
> where the disk is the register file, the partitions are the rv-base integer
> and fp register files, and the blocks are allocatable registers that the
> compiler allocates in the register allocator.

 if this was RVV it would be easy: VL is limited by MVL which is a
hard-coded limit, and hwacha showed it's possible to have a 1R1W SRAM
behind a vectorised front-end register-file abstraction.  MVL can i
believe therefore be adjusted dynamically to suit the available amount
of actual SRAM.

 we don't have that luxury: when VL is set, it *absolutely* has to be
set according to what's requested (where an RVV system may say, "you
asked for a VL of 2^31, sorry i actually only have 2!" and that's
perfectly fine.

 so that VL gets hard-coded into the assembly.  it's a different story
for a JIT compiler (with LLVM) as opposed to say gcc, and that's one
of the reasons why i'm nervous about the shared / partitioning scheme:
gcc and other static compilers are placed at a significant
disadvantage.

> Wether or not we end up adding caching, i really like combining register
> renaming with a scoreboard and reorder buffer,

 it's just... all the other things that are missing (and completely
unclear to me still, as to how they're supposed to be added).  where
TS+ROB, it took a couple of days, i "get it", i've worked out how to
avoid the CAM power load, worked out how to do SIMD, and predication,
and we get multi-issue for free, *and* renaming, *and* rollback on
exception branch mis-prediction...

 and there's no bypassing needed *and* there's no special forwarding
needed, as that's all automatic and inherent within the TS+ROB
algorithm.

> since we could split the
> register file so each alu writes to only one portion of the register file
> and we could allocate each new register from the portion associated with
> the alu creating the value. this would allow us to greatly reduce the
> number of write ports required for each register file portion.

 there's probably a delay, here, i think you may be catching up with
some of the messages/thoughts, here:
 https://libre-riscv.org/3d_gpu/rob_nocam_reglanes.jpg

 this one i combined the "striping" to reduce the number of ports.
that scheme can still write 4 sequentially-numbered 32-bit values
(r1-MSW/r1-LSW/r2-MSW/r2-LSW) per clock cycle, and also _read_ 8
sequentially-numbered 32-bit values at the same time.

> I guess it's
> similar to tomasulo's algorithm except that the part associated with the
> alu stores the results instead of the inputs.

 ... i _think_ i get you, here :)

> >
> >  in those loops you referred to, how many 32-bit values are there, and
> > how many times are they referenced (as registers) more than once, in
> > between LD and ST?  an answer to that question will give us a clear
> > idea of how large register caches would need to be.
> >
>
> I don't recall which loops I was referring to, but I'm designing kazan so
> the entire shader is 1 iteration of the inner loop.

 ah cool.  ok, so that one, not plural :)

l.