[libre-riscv-dev] 3d gpu microarchitectural requirements review

Luke Kenneth Casson Leighton lkcl at lkcl.net
Tue Nov 6 08:42:58 GMT 2018


https://libre-riscv.org/3d_gpu/microarchitecture/

i'd like to capture the requirements for the GPU microarchitecture, to
make sure that it's capable of hitting the target of 5GFLOPs,
150MPix/sec, 30MTriangles/sec, within a sub 1W power budget @ 28nm for
the GPU side, and 2W for the full GPGPU (OS+GPU) side.  VPU workload
can also push out to 2.5 watts for 1080p60 MPEG/MP4 decode.  it's
quite modest requirements, computationally and power-wise.

whilst i would really really like to experiment with a 4-wide L1/L2
style SMP cache coherent register file on the front of 4 standard
scalar ALUs per core instead of a 4-wide SIMD per core, i think that
may be a bit too much to go for in a first revision, and it's
well-known that you get traffic "concertina" effects in the data
throughput in large vector processors:
https://en.wikipedia.org/wiki/Accordion_effect

so i like the idea of having straight e.g. 4-wide SIMD ALUs with
predication, and in particular (jacob) i liked the idea of having
1Rx1W 4-lane SRAMs for the upper-numbered registers, which you
suggested a few weeks ago.

i would like to consider a register "renaming", either on 32-bit or
even down to 16-bit, so that e.g. if we pick 32-bit as the regfile
granularity, an operation on e.g. scalar x30 is (when elwidth =
default) actually passing "x30:0 & x30:1" to the ALU.

it's a simple form of register-renaming, and i believe it will help
when it comes to routing data around for non-default elwidths,
particularly when it comes to the "remap" phase.

one area that this is definitely relevant is those 4x3 x 4 arrays of
vectors / matrices, where the SHAPE CSRs will definitely result in
lane-crossing.  how that's to be implemented efficiently? i honestly
have no idea!

lots to discuss.

l.



More information about the libre-riscv-dev mailing list