[libre-riscv-dev] 3d gpu microarchitectural requirements review

Tue Nov 13 03:22:52 GMT 2018

On Tue, Nov 6, 2018 at 12:43 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> https://libre-riscv.org/3d_gpu/microarchitecture/
>
> i'd like to capture the requirements for the GPU microarchitecture, to
> make sure that it's capable of hitting the target of 5GFLOPs,
> 150MPix/sec, 30MTriangles/sec, within a sub 1W power budget @ 28nm for
> the GPU side, and 2W for the full GPGPU (OS+GPU) side.  VPU workload
> can also push out to 2.5 watts for 1080p60 MPEG/MP4 decode.  it's
> quite modest requirements, computationally and power-wise.
>
> whilst i would really really like to experiment with a 4-wide L1/L2
> style SMP cache coherent register file on the front of 4 standard
> scalar ALUs per core instead of a 4-wide SIMD per core, i think that
> may be a bit too much to go for in a first revision, and it's
> well-known that you get traffic "concertina" effects in the data
> throughput in large vector processors:
> https://en.wikipedia.org/wiki/Accordion_effect
>
> so i like the idea of having straight e.g. 4-wide SIMD ALUs with
> predication, and in particular (jacob) i liked the idea of having
> 1Rx1W 4-lane SRAMs for the upper-numbered registers, which you
> suggested a few weeks ago.
>
I think we'll want to go with 2Rx1W SRAMs as (from what I remember) 1Rx1W
SRAMs could cause a slow-down for shorter vectors (16 elements may be long
enough, 4 elements is definitely not).

Probably the best course of action would be to build a generic
implementation where we can adjust the SRAM kind and see if 1Rx1W has too
large of a penalty.

>
> i would like to consider a register "renaming", either on 32-bit or
> even down to 16-bit, so that e.g. if we pick 32-bit as the regfile
> granularity, an operation on e.g. scalar x30 is (when elwidth =
> default) actually passing "x30:0 & x30:1" to the ALU.
>
I don't know if that will help much here, but we can try it.

>
> it's a simple form of register-renaming, and i believe it will help
> when it comes to routing data around for non-default elwidths,
> particularly when it comes to the "remap" phase.
>
> one area that this is definitely relevant is those 4x3 x 4 arrays of
> vectors / matrices, where the SHAPE CSRs will definitely result in
> lane-crossing.  how that's to be implemented efficiently? i honestly
> have no idea!
>
I was thinking of having a inter-lane crossbar on each of the ALU inputs,
this should allow for tighter packing, though we may save power by just
running a lane empty and not needing fancy routing to repack 3-wide
vectors. 1-wide and 2-wide vectors should be easy to repack.

For the hardware design, do you want it to be written in VHDL or Verilog,
or should we use something more like Chisel, in that it's a higher-level
language and it generates a lower-level Verilog version?

Jacob