[libre-riscv-dev] Introduction and Questions

Fri May 15 21:11:09 BST 2020

hi jeremy, welcome.

here you may have noticed, there are two chips: the 180nm test ASIC
which we have a tight deadline to meet, mid-august we need to freeze
the design and immediately move to doing layout using Coriolis2.  this
one we are extremely unlikely to even have FP operations.

beyond that we have a second ASIC to do, the quad-core SMP 800mhz
dual-issue.  that's where you saw the numbers for.

On Fri, May 15, 2020 at 8:33 PM Yehowshua <yimmanuel3 at gatech.edu> wrote:

> I know we initially were targeting 5GFLOPS for the GPU on the first tapeout.

this was to be 5-6 GMACs capability for the quad-core SMP 800mhz
dual-issue (second target).  however... *sheepish*... we kinda
drastically overshot that on the internal architecture :)  it was to
be 5-6 GMACS which is of course 10-12 GFLOPS (FP MUL and FP ADD in
FPMAC counting as 2).

also each core - all four of them - is architected to handle a minimum
128-bit-wide LOAD/STORE data path (4x FP32), and there will be 4x FP32
- FMACs - per clock cycle, because they will be issued as 2x 64-bit
operations (2x 32-bit FPs in each 64-bit) and it is dual issue.

800mhz x 2 issue x 2-32 per 64-bit x 4 = 12.8 GMACs which x2 because
FPMUL and FPADD is 2 ops = 25.6 GFLOPs.

a teeny tiny bit over what we initially planned, then :)

> Basically, the way we’re doing the GPU is the add a vector FPU along with
> accompanying instructions into the actual CPU.

by overloading "multi-issue".  a hardware for-loop at the instruction
execution phase will push multiple sequential instructions into the
OoO engine from a *single* instruction.

> We would then just write the drivers that translate various shaders and drawing
> commands into vector instructions which can be scheduled on the scoreboard.

rather than having a ridiculous "Remote Procedure Call" system that
packs up Vulkan / OpenGL commands into a serial marshalled data stream
from the application, communicated to the kernel, shipped over PCIe or
other memory bus architecture, unpacked at the GPU, turned into
instructions, executed, and the results shipped *back* the other way

total insanity.

> Realistically, for our first tapeout(we’re using Google’s shuttle service on 1800nm TSMC),
> we’d do a CPU with no vector instructions(effectively no GPU) for October.

straight POWER9 however still OoO

> Since we’re doing FOSS down to the VLSI design cells(we’re using Time Ansell’s OpenPDK as an intermediate), we need somebody to do a PLL.

a Professor from LIP6 has offered to do that.  i introduced you to him
and Tim, yehowshua.

> With a FOSS PLL in place, we can get 300MHZ on the first tapeout. We’re doing a single core for the
> first tape out, with 6 stages I think…

the DIV pipeline will be... 8 (maybe more), MUL will be 2, ADD (etc.) will be 1.

* fetch: 1.
* decode: 1.
* issue and reg-read: 1 (minimum).
* execute: between 1 and 8 (maybe more)
* write: 1 (minimum)

POWER9 has update mode for LD/ST so those would take 1 extra.  some
ALU operations need to write to the Condition Register, that's another
extra 1 write (possibly).

etc.

basically this is why we're doing OoO because coordinating all that in
an in-order system would be absolute hell.

> With no PLL, I think we’re limited to 25-50MHz.

Staf mentioned that there's no reason why we should not attempt to
drive the external CLK line at 100mhz.  it could potentially radiate
EM like stink, however it might work.  we'll see :)

> Some simple math could give a DMIPS estimate.

it's down ultimately to how wide the register paths are, between the
Regfile and the Function Units.  if we have multiple Register Data
Buses, and can implement the multi-issue part in time, it'll be coded
in a general-purpose enough way that we could consider 4-issue.
however... without the register data paths, there's no point.

it comes down to how much time we have.

l.