[libre-riscv-dev] GPU design

Tue Dec 4 01:02:55 GMT 2018

On Mon, Dec 3, 2018 at 11:02 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> I created a simple diagram of what I think would work for the ALUs and
> register file for the GPU design. The diagram doesn't include forwarding or
> pipeline registers.
>
> https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg

 nice. very clear.  thoughts: those would need to be 64-bit wide (in
order to handle up to the 64-bit FP and also SIMD), so those muxes (2
each per lane) are taking in 256 bits each, that's 512 input wires per
lane, 4 lanes is 2048 wires, which seems like an awful lot.  oh, darn:
two register files (one int, one FP), so 4096 wires.

 estimated number of gates in a 4-in priority mux: abouuut... 20?  so
it would be somewhere around 80,000 gates for the lane routing.
https://www.electronics-tutorials.ws/combination/comb_2.html

 which, as we've not done any other comparative analysis of other
options yet, i don't know if this is relatively high or around what
we'd need regardless of which option is picked.

 the other alternative that mitch alsup suggested, i recorded his
advice on the microarchitecture page: you just lengthen out the
pipeline by as many stages as is required to read the source operands.
really really simple.

 now, could we use a hybrid approach? possibly!  we'll find out :)

> I noticed that if we use register renaming, we can allocate the output
> registers of each of the 4 lanes in such a way that the register file can
> be split into 4 parts with each part only being written by its associated
> lane, meaning that we can get away with only a few write ports, 1 for each
> supported instruction latency. I'm planning on supporting single-cycle
> instructions (integer add, sub, xor, etc.), 3-4 cycle instructions (fadd,
> fmul, fmadd, load, etc.) and for longer instructions (fdiv, integer div,
> etc.) just stall the rest of the processor when the instructions finish in
> order to create a free slot to write, though we could add another write
> port if long instructions are too slow.

 i'm... not totally enamoured with something that relies on stalling
the entire core to deal with a bottleneck.

> Note that there are 0xC0 hardware registers because we need 0x80 for the
> architecturally visible registers, and the other 0x40 are used for
> renaming. 0x40 spare registers should be enough because that's enough for 4
> 16-cycle instructions issued per clock.
>
> I'm planning on adding additional forwarding to skip the extra cycle needed
> to read/write the register file.
>
> Note that the GPU probably won't be a 4-wide-issue processor, those are
> just the per-element operations generated from single vectorized operations.

 the augmented-tomasulo i'm currently investigating, i also agree
4-wide-issue is probably far too much: it means that on every clock
cycle you need 4 simultaneous instruction-decoders, 4 simultaneous
entries-into-the-reorder-buffer.

 plus, assuming a 100% pipeline fill (unrealistic but ok for
illustrative purposes) you would also need a 4-wide Common Data Bus
(64-bit x 4) meaning, there's no point issuing 4 instructions if the
results are bottlenecked.

 not only that: each "listener" - the other ALUs, the load buffer, the
reorder buffer - all need 4-wide inputs, and the CAM entries in the
reorder buffer would also need to be 4-wide triggers.

 although it would be great for a high-performance core, we're doing
mobile, first :)  so, 2-issue would be much more sane.

l.