[libre-riscv-dev] Vulkanizing

Wed Feb 19 18:45:33 GMT 2020

On Wed, Feb 19, 2020 at 5:16 PM Doug Moen <doug at moens.org> wrote:
>
> The web site "libre-riscv.org/3d_gpu", first line of text, says:
>     RISC-V 3D GPU / CPU / VPU

thanks, i removed that.

> But I understand you are now using the POWER architecture?

not exactly: just the POWER ISA (ok ok and the additional pieces which
go with that)

> This is not a conventional GPU architecture, so it will have different performance characteristics from a GPU, and this is what I would like to understand.

yes.  we do get it.

> A conventional GPU uses the SIMT architecture (single instruction, multiple threads). Discrete desktop GPUs from AMD and Nvidia supports thousands of hardware threads running simultaneously, each with their own register set. An AMD Ryzen 4000 APU (CPU and GPU in the same package) supports between 320 and 512 hardware threads.

leaving the numbers aside: you're describing "single instruction,
multiple data" but gone mad.  it's been recognised in the industry -
thanks to the billions spent - that SIMT is unmanageable at the
software level.  Mitch Alsup was only a consultant on the Samsung GPU
project, and his warnings not to implement SIMT were not heeded.

> Your project uses a conventional CPU architecture. There are multiple cores, and multiple SMT threads within each core.

no, we are not going hyperthreaded (if that's what you mean).  our
design is more like MALI 400, Broadcom VideoCore IV and so on, i.e. it
is more an "embedded" GPU.. or...
CPU-with-full-general-purpose-instructions-plus-some-instructions-you'd-normally-only-find-on-a-GPU.

> There are way more transistors dedicated to each hardware thread, and therefore there are way fewer hardware threads available. It's a tradeoff. Each hardware thread supports a more general model of computation than the threads in a GPU, which makes it more versatile and easier to program, but you lose a lot of parallelism compared to a GPU.

except... because the CPU *is* the GPU, we have one *less* set of
cores to worry about, one *less* entire set of L1 caches, an entire
memory-to-memory architecture gone from the complexity, and a massive
swath of insanely complex "userspace-kernelspace-gpuspace-and-back"
inter-process communication wiped off the map.

> Because you have specialized GPU instructions, this will be faster than a software GPU (llvmpipe) running on a POWER cpu, but it will still be significantly slower than a conventional GPU for conventional GPU workloads, due to having 10x or 50x fewer hardware threads.

it turns out that numbers-wise and power-performance wise, i did the
math, and we turn out to be on-par.

therefore if we were to drop our design on top of OpenPITON, crank up
the internal bus bandwidth a notch or five and dial the multi-issue
execution up to eleven, i see no reason why we shouldn't give AMDGPU
and NVIDIA a damn good kicking.

what *does* concern me - because it is usually the largest part of a
GPU - is the FP Unit accuracy requirements that come from IEEE754
conformance.  where "competitor" products can get away with a few bits
of inaccuracy (because for 3D scenarios it really doesn't matter), we
may end up being penalised, there.

Mitch Alsup warned us that for some FP functions it can increase the
hardware FOURFOLD simply by trying to get the ULP error > 1.0 up to
IEEE754 "total accuracy" requirements, when compared to
state-of-the-art designs of trigonometric IEEE754 units used in "less
accurate" 3D, for example.

there are some tricks we can do if it really matters, but honestly i'm
not looking forward to it, given that it's taken six months to get the
IEEE754 FP unit to the point it is, now.  and we still have to do SIN,
COS, ATAN2, CORDIC etc.

> You are providing a Vulkan driver, which is great, but then people will benchmark this processor using conventional Vulkan apps, which are designed around the strengths and limitations of conventional GPUs. It will be too slow to run AAA video games.

in our first "major" chip, we're aiming for "performance and power
consumption similar to MALI 400".

> You are using the POWER architecture, which is designed for supercomputing and compute-heavy server applications. POWER is not known to be well suited for IOT, mobile or laptops (Apple transitioned the Macintosh from POWER to Intel due to power consumption issues).

yes.  it's *not known* to be well-suited to embedded use.  the key
being "not known".  google Freescale (now NXP) embedded POWER ISA
range of processors.

and if the designers at IBM didn't do their homework on the internal
microarchitecture, then their (major) customer voted correctly with
their wallet.

> So are you primarily targeting desktop computers, and competing primarily with Intel integrated graphics and AMD APUs?

no, absolutely not - not at this first stage.  walk before run.

we're aiming for something that is equivalent to the Allwinner A64, or
the Rockchip RK3288 (if the RK3288 was 64-bit) and only with a 32-bit
DDR3/4 DRAM interface.  see the target specs on the page:

https://libre-riscv.org/3d_gpu/

if you are not familiar with this style of processor, think
"smartphone", "chromebook", "netbook", "tablet", and you'll be in the
right ballpark.

*ONCE* we have hit that milestone, *THEN* we move to thousand-core behemoths.

> So what am I missing?

the "run before walk" phase.

> What use cases is this new processor being designed for?

smartphones, chromebooks, netbooks and tablets.

> What are your performance goals for Vulkan applications,

25fps @ 720p, 5-6 GFLOPs (we might have overcooked the design by a
factor of 4, though), 50-100 million triangles/sec.  which sounds
extremely modest, however it's in a power budget of 2.5 watts.

> and how will you achieve them?

by setting "achievable" intermediary milestones of lower performance
and far less complex targets (no special opcodes initially) then
incrementally iterating and experimenting until we achieve the stated
goal.

l.