[libre-riscv-dev] NLNet Funded development of a software/hardware MESA driver for the Libre GPGPU

Tue Jan 14 00:21:00 GMT 2020

On Tuesday, January 14, 2020, Jacob Lifshay <programmerjake at gmail.com>
wrote:

> On Mon, Jan 13, 2020 at 9:39 AM Jason Ekstrand <jason at jlekstrand.net>
> wrote:
> >
> > On Mon, Jan 13, 2020 at 11:27 AM Luke Kenneth Casson Leighton <
> lkcl at lkcl.net> wrote:
> >> jason i'd be interested to hear your thoughts on what jacob wrote, does
> it alleviate your concerns, (we're not designing hardware specifically
> around vec2/3/4, it simply has that capability).
> >
> >
> > Not at all.  If you just want a SW renderer that runs on RISC-V, feel
> free to write one.

as we know, it would be embarrassingly low performance, not commercially
viable, therefore, logically, we can rule that out as an option to pursue :)

i don't know if you're aware of Jeff Bush's work on Nyuzi? he set out to
duplicate the work of the Intel Larrabee team (a software only GPU
experiment), in an academic way (i.e publishing everything, no matter how
"bad")

Jeff sought an answer to the question as to, ahem, why the Larrabee team
were not, ahem, "permitted" to publish GPU benchmarks for their work,
despite it having high end supercomputer-grade Vector Processing capability.

i spent several months in discussion with him, i really enjoyed the
conversations.  we established that if you were to deploy a *standard*
Vector Processor General Purpose ISA and engine (Nyuzi, Cray, MMX/SSE/AVX,
RISCV RVV), with *zero* special custom hardware for 3D (so, no custom
texturisation, no custom z buffers, no special tiled memory or associated
pixel opcodes) the performance/watt that you would get would be a QUARTER
of current commercial GPUs.

in other words you need either four times the silicon (four times the power
consumption) just to be on par with current commercial GPUs, or you have to
sell (only if completely delusional) something that's 25% the performance.

therefore, we have learned from that lesson, and will not be following that
exact route either :)

 If you want to vectorize in hardware and actually get serious performance
> out of it, I highly doubt his plan will work.  That said, I wasn't planning
> to work on it so none of this is my problem so you're welcome to take or
> leave anything I say. :-)

:)

> So, since it may not have been clearly explained before, the GPU we're
> building has masked vectorization like most other GPUs, it just that
> it additionally supports the masked vectors' elements being 1 to 4
> element subvectors.

further: this is based on RVV (RISCV Vectors) which in turn is based on the
Cray Vector system.

the plan is to *begin* from this base, and, following the strategy that's
documented in Jeff Bush's 2016 paper, assess performance based on
pixels/clock and also, again, following Jeff's work, keep a Seriously Close
Eye on the power consumption.

(we've already added 128 registers, for example, because on GPU workloads,
which are heavily LD-compute-ST on discontiguous memory areas, you
absolutely cannot afford the power penalty of swapping out large numbers of
registers through the L1/L2 cache barrier)

Jeff's strategies we will use as *iterative* guides to making improvements,
just ad he did.  he actually went through seven different designs (maybe 8
if you include the ChiselGPU triangle raster engine he wrote)

If it turns out that using subvectors makes the GPU slower, we can add
> a scalarization pass before the SIMT to vector translation, converting
> everything to using more conventional operations.

yes, exactly.  and that would be one of the kinds of tasks for which the
NLNet funding is available.

so that would be one very good example of something that would be assessed
using Jeff Bush's methodology.

what's nice about this is: it's literally an opportunity for a Software
Engineer working on MESA to, instead of saying "damnit these hardware
engineers really messed up, i feel totally powerless to fix it", to say
"this isn't good enough! i need instruction X to get better performance!"
and instead of saying "sorry we taped out already, deal with it, derwood"
we go, "okay, great, give us 2 weeks and you can test out a new instruction
X. start writing code to use it!"

i know that there is someone out there who, on reading this, is going to go
"cool! and the actual hardware's libre too, and.. wait... i get money for
this???"

:)

so again, jason, i'd like to emphasise again just how grateful i am that
you raised the issue of subvectors, because now we can put it on the list
of things to watch out for and experiment with.

and, just to be clear: we've already had this iterative approach approved
by NLNet: to start from a known-good (highly suboptimal but Vulkan
Compliant) driver and to experiment with designs (hopefully not at the
microarchitectural level) and instructions (a lot) and change the ISA
(hopefully not a lot), to, over time, reach commercially acceptable
performance.

and it's entirely libre.  paid...and libre.  who knew _that_ would ever
happen in the GPU world?

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68