[libre-riscv-dev] Introduction and Questions

Fri May 15 21:50:23 BST 2020

> a teeny tiny bit over what we initially planned, then :)

Thanks Luke, these numbers are closer to what seems reasonable to me.
A simple superscalar core with 2 FPU functional units should be able
to achieve 4 GFLOPS, so a vector unit would need to go higher than
that to be worth the extra silicon.

> by overloading "multi-issue".  a hardware for-loop at the instruction
> execution phase will push multiple sequential instructions into the
> OoO engine from a *single* instruction.

I see, so a form of micro-op fission, essentially you are decoding
complex vector (or "GPU") instructions into a sequential stream of
simple micro-ops? This seems like a pretty reasonable design point.

> the DIV pipeline will be... 8 (maybe more), MUL will be 2, ADD (etc.) will be 1.
> * fetch: 1.
> * decode: 1.
> * issue and reg-read: 1 (minimum).
> * execute: between 1 and 8 (maybe more)
> * write: 1 (minimum)

So this is just for the preliminary tapeout? I would expect an OOO
core targeting 1GHz+ to have more pipeline stages than this (10+). The
A72, Skylake, Zen, and BOOM all seem to target 10+ stages.

> basically this is why we're doing OoO because coordinating all that in
> an in-order system would be absolute hell.

I'm a bit confused now. In-order execution with variable-latency
functional units can be achieved with a scoreboard, and the pipeline
stages you describe reminds me of a in-order core with OOO write-back.
Is this what you are doing? Or are you pursuing true OOO execution
with register-renaming?

> it's down ultimately to how wide the register paths are, between the
> Regfile and the Function Units.  if we have multiple Register Data
> Buses, and can implement the multi-issue part in time, it'll be coded
> in a general-purpose enough way that we could consider 4-issue.
> however... without the register data paths, there's no point.

Fair enough. Balancing the throughput of all the components in a OOO
core is tricky, and register read can certainly be a bottleneck.

> We had some unfortunate run-ins with Berkeley and RISCV - and we decided to go with POWER because we have access to some important relations at IBM

Have you looked at the open RISC-V micro-architectures though? Even if
this core runs POWER, it should be not too difficult to port design
ideas from existing cores, to avoid reinventing the wheel everywhere.

On Fri, May 15, 2020 at 1:12 PM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>
> hi jeremy, welcome.
>
> here you may have noticed, there are two chips: the 180nm test ASIC
> which we have a tight deadline to meet, mid-august we need to freeze
> the design and immediately move to doing layout using Coriolis2.  this
> one we are extremely unlikely to even have FP operations.
>
> beyond that we have a second ASIC to do, the quad-core SMP 800mhz
> dual-issue.  that's where you saw the numbers for.
>
> On Fri, May 15, 2020 at 8:33 PM Yehowshua <yimmanuel3 at gatech.edu> wrote:
>
> > I know we initially were targeting 5GFLOPS for the GPU on the first tapeout.
>
> this was to be 5-6 GMACs capability for the quad-core SMP 800mhz
> dual-issue (second target).  however... *sheepish*... we kinda
> drastically overshot that on the internal architecture :)  it was to
> be 5-6 GMACS which is of course 10-12 GFLOPS (FP MUL and FP ADD in
> FPMAC counting as 2).
>
> also each core - all four of them - is architected to handle a minimum
> 128-bit-wide LOAD/STORE data path (4x FP32), and there will be 4x FP32
> - FMACs - per clock cycle, because they will be issued as 2x 64-bit
> operations (2x 32-bit FPs in each 64-bit) and it is dual issue.
>
> 800mhz x 2 issue x 2-32 per 64-bit x 4 = 12.8 GMACs which x2 because
> FPMUL and FPADD is 2 ops = 25.6 GFLOPs.
>
> a teeny tiny bit over what we initially planned, then :)
>
> > Basically, the way we’re doing the GPU is the add a vector FPU along with
> > accompanying instructions into the actual CPU.
>
> by overloading "multi-issue".  a hardware for-loop at the instruction
> execution phase will push multiple sequential instructions into the
> OoO engine from a *single* instruction.
>
> > We would then just write the drivers that translate various shaders and drawing
> > commands into vector instructions which can be scheduled on the scoreboard.
>
> rather than having a ridiculous "Remote Procedure Call" system that
> packs up Vulkan / OpenGL commands into a serial marshalled data stream
> from the application, communicated to the kernel, shipped over PCIe or
> other memory bus architecture, unpacked at the GPU, turned into
> instructions, executed, and the results shipped *back* the other way
>
> total insanity.
>
> > Realistically, for our first tapeout(we’re using Google’s shuttle service on 1800nm TSMC),
> > we’d do a CPU with no vector instructions(effectively no GPU) for October.
>
> straight POWER9 however still OoO
>
> > Since we’re doing FOSS down to the VLSI design cells(we’re using Time Ansell’s OpenPDK as an intermediate), we need somebody to do a PLL.
>
> a Professor from LIP6 has offered to do that.  i introduced you to him
> and Tim, yehowshua.
>
> > With a FOSS PLL in place, we can get 300MHZ on the first tapeout. We’re doing a single core for the
> > first tape out, with 6 stages I think…
>
> the DIV pipeline will be... 8 (maybe more), MUL will be 2, ADD (etc.) will be 1.
>
> * fetch: 1.
> * decode: 1.
> * issue and reg-read: 1 (minimum).
> * execute: between 1 and 8 (maybe more)
> * write: 1 (minimum)
>
> POWER9 has update mode for LD/ST so those would take 1 extra.  some
> ALU operations need to write to the Condition Register, that's another
> extra 1 write (possibly).
>
> etc.
>
> basically this is why we're doing OoO because coordinating all that in
> an in-order system would be absolute hell.
>
> > With no PLL, I think we’re limited to 25-50MHz.
>
> Staf mentioned that there's no reason why we should not attempt to
> drive the external CLK line at 100mhz.  it could potentially radiate
> EM like stink, however it might work.  we'll see :)
>
> > Some simple math could give a DMIPS estimate.
>
> it's down ultimately to how wide the register paths are, between the
> Regfile and the Function Units.  if we have multiple Register Data
> Buses, and can implement the multi-issue part in time, it'll be coded
> in a general-purpose enough way that we could consider 4-issue.
> however... without the register data paths, there's no point.
>
> it comes down to how much time we have.
>
> l.
>
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev