[libre-riscv-dev] Request for input and technical expertise for Systèmes Libres Amazon Alexa IOT Pitch 10-JUN-2020

Mon Jun 8 12:13:48 BST 2020

On Mon, Jun 8, 2020 at 11:17 AM Staf Verhaegen <staf at fibraservi.eu> wrote:

> Given that you currently are doing power instruction set I expect you
> will have more power consumption in decoding the power instruction set,
> so even if the GPU part would cause less overhead would you still be
> better than ARM + MALI or RISC-V + some GPU ?

the architecture of a CPU + GPU arrangement is as follows:

* CPU L1 I-Cache, D-Cache
* GPU L1 I-Cache, D-Cache
* inter-connect on-die Shared Memory Bus between CPU and GPU.

therefore, to perform even the simplest GPU-based instruction - even
just the one instruction, the following must be actioned:

* CPU must run a compiler that turns (Vulkan?) command into GPU instructions
* CPU must "package" GPU instructions into a serial buffer.
* CPU must send the serial buffer through the CPU L1 D-Cache, onto the
Shared Memory Bus, into the GPU L1 I-Cache
* GPU must execute the instructions.
* GPU must respond indicating completion of the commands, back through
the L1 Caches, over the Shared Memory Bus.

in a hybrid architecture, steps 2 thru 5 are replaced with *DIRECT*
execution of the GPU instruction.

it should be clear that a complex detailed gate-level power analysis
of this scenario is not needed, because even at a high level it should
be obvious that there is significant power consumption just by having
this type of dual-processor architecture compared to a hybrid one.

where that does *NOT* work is if design mistakes are made in the
hybrid CPU-GPU architecture such as having insufficient registers.

Jeff Bush's Nyuzi work showed very clearly that the highest power
consumption of a GPU is getting the data through the L1 cache barrier.
GPU workloads are almost exclusively:

* LOAD
* compute
* STORE

with negligeable or zero cross-references between each cycle.  compare
this with general-purpose CPU workloads:

* LOAD
* partial compute
* STORE
* LOAD partial results (from multiple *different* previously
partially-computed prior sequences)
* compute
* STORE

i vaguely recall somewhere someone saying that the cross-referencing
of prior data, in general-purpose CPU workloads, is somewhere around
30% back-referencing of prior data.  in GPU workloads, due to the
parallelism, this is **ZERO** for the majority of data (hundreds of
megabytes per second)

therefore if *any* of the GPU data has to go back through the L1 Cache
(because there are not enough registers), it can result in a whopping
50% increase in power consumption when compared to a design that does
have enough registers to keep the computation from spilling over.

again: this does not need gate-level analysis, it's just a design
architectural fact.

another key critical aspect of what we are doing is down to the fact
that the CPU and GPU - being now one and the same - effectively have
the exact same I-Cache.

normally in a split CPU-GPU architecture, CPU execution would be
unaffected by GPU workload execution, because it is totally
*different* I-Caches.  because they are different, the designers of
the GPU ISA completely ignore instruction compactness, often using
VLIW (128-bit instructions) or 64-bit instructions, on the basis that
each of these instructions is executing 4 parallel tracks (4x FP
operations), and that therefore the high bang-per-buck ratio easily
justifies the large instruction rate.

with a shared I-Cache, when executing GPU instructions these require
context-switching the CPU instructions out. such massively-large
instructions become a serious problem for a hybrid architecture.

therefore, to that end, we effectively designed Simple-V to be a
"hardware compression algorithm" - to compress a *Vectorised* ISA at
the *hardware* level.

instead of 64 or 128 bit Vector instructions, we have 32-bit
Vector-capable instructions and in some cases even 16-bit
Vector-capable instructions.

this *drastically* reduces I-cache utilisation, which is a major
factor in power consumption due to context-switching.

all of these things are at the architectural level.  we are not doing
anything fancy at the gate level.  it is a matter of making
*architectural* decisions that reduce power consumption, operating
within the exact same gate-level power consumption constraints as
every other ASIC out there.

how exactly this would be communicated - in summary form - to
Executives, i have no idea.

> I don't understand the BOM advantage when CPU and GPU are integrated in one chip, e.g. ARM Cortex + MALI GPU.

i also commented that this is incorrect.

l.