[libre-riscv-dev] 3d gpu microarchitectural requirements review

Tue Nov 13 10:37:21 GMT 2018

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Tue, Nov 13, 2018 at 3:23 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> On Tue, Nov 6, 2018 at 12:43 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:

> > so i like the idea of having straight e.g. 4-wide SIMD ALUs with
> > predication, and in particular (jacob) i liked the idea of having
> > 1Rx1W 4-lane SRAMs for the upper-numbered registers, which you
> > suggested a few weeks ago.
> >
> I think we'll want to go with 2Rx1W SRAMs as (from what I remember) 1Rx1W
> SRAMs could cause a slow-down for shorter vectors (16 elements may be long
> enough, 4 elements is definitely not).

 ok.

> Probably the best course of action would be to build a generic
> implementation where we can adjust the SRAM kind and see if 1Rx1W has too
> large of a penalty.

 good idea.  noted.

> >
> > i would like to consider a register "renaming", either on 32-bit or
> > even down to 16-bit, so that e.g. if we pick 32-bit as the regfile
> > granularity, an operation on e.g. scalar x30 is (when elwidth =
> > default) actually passing "x30:0 & x30:1" to the ALU.
> >
> I don't know if that will help much here, but we can try it.

 it just depends on how important FP16 really is, and whether it's
used a lot.  if it isn't, there's not a whole lot of point breaking
down the regfile to 16-bit chunks, as it means more guide-logic on the
routing, plus extra tag bits.

 the scheme is still relevant though, for 64-bit wide regfiles, i feel
it's definitely going to be important to break down to 32-bit wide,
given that we're definitely going to be doing REMAP on 32-bit
single-precision FP.

 so while at the top level we would have a 64-bit INT/FP regfile, with
128 nominal entries, actually it would be 256 32-bit entries, where
any 64-bit ops would require passing pairs of 32-bit "real" registers
around.

 (see below) when REMAP is needed, particularly for those matrix
multiplies, remap acts on the underlying 32-bit "real" register
numbers.

 it would be insane to take this down to the byte-level, though
(elwidth=8) - a 64-bit add would need 8x "byte" regs for src1, 8x
"byte" regs for src2, and 8x for destination.

> > it's a simple form of register-renaming, and i believe it will help
> > when it comes to routing data around for non-default elwidths,
> > particularly when it comes to the "remap" phase.
> >
> > one area that this is definitely relevant is those 4x3 x 4 arrays of
> > vectors / matrices, where the SHAPE CSRs will definitely result in
> > lane-crossing.  how that's to be implemented efficiently? i honestly
> > have no idea!
> >
> I was thinking of having a inter-lane crossbar on each of the ALU inputs,

 ok, so you know data's in one lane of the regfile (4 lanes?), you
pass the top 3 bits (actually, bits 2-6 as we have 7-bit register
numbers) to the lane regfile selector and the bottom 2 bits (0-1) to
the crossbar.

> this should allow for tighter packing, though we may save power by just
> running a lane empty and not needing fancy routing to repack 3-wide
> vectors. 1-wide and 2-wide vectors should be easy to repack.

 i realised after writing the question, that the REMAP simply changes
the register number(s).  so as long as the elwidth matches the
register file bitwidth, it's a no-brainer: the renaming done by REMAP
has nothing special to do.

 if however the register file bitwidth is say 32 and the elwidth is
16, or the register file bitwidth is 64 and the elwidth is 32, now we
have a bit of a problem, as if the REMAP wants to split one of the
registers in half (or 1/4s) you need a full 4-in, 4-out crossbar to do
it.

 i.e. we need xBitManip!  which got me thinking, perhaps at the 8-bit
and 16-bit level, we could *use* xBitManip, transparently, by making
it one of the pipeline stages, or using a microcode engine to issue
multiple instructions to route the bytes/words through xBitManip in
the ALU... oh and *then* drop it into the destination.

 however... all that makes me nervous, and it may just be better to
punt this one to software emulation.

> For the hardware design, do you want it to be written in VHDL or Verilog,
> or should we use something more like Chisel, in that it's a higher-level
> language and it generates a lower-level Verilog version?

 my feeling is, we have a hell of a lot to get done, and it has to be
able to run a full GNU/Linux SMP distro: i really don't think it's a
good idea to start from scratch.  it also means, picking a base that
is already working that can already run (say) debian, 2-4 core SMP.

 there's not a lot of those out there (that are libre/open), the ones
i know of are the rocket-chip (more specifically, the freedom u540)
and lowRISC (rocket-chip again) and the IIT Madras core.  pulpino has
a different focus (it's not SMP).

IIT Madras are focussed on using bluespec: it's a proprietary license.
they're focussed on other projects: if they were willing to engage
fully i'd be really *really* happy to work with them as they're a
fantastic team.  without a firm committment from them, we can't
possibly risk developing a *libre* SoC on a proprietary compiler...
without having a major, major focus on writing a libre replacement for
the proprietary bluespec compiler.

i really really like MyHDL: i can actually read it, as it's python.
this does exist: https://github.com/AngelTerrones/Algol - it's only
RV32 (which isn't so bad to upgrade), it's only 1.9-privspec (which is
laborious), i have no idea how to plumb it into TileLink (which would
get us the SMP cache coherency)... it's a *lot* of work just to get to
the jump-start position of being able to run a full GNU/Linux SMP
distro.

realistically, unless anything else can be found, it's looking like
something based on rocket-chip, which means Chisel3.

anyone know of an alternative RV64GC base that's SMP, 4-core, cache
coherent, capable of a minimum 800mhz, is libre-licensed, and can run
a full Debian GNU/Linux distro, right now, in an FPGA?

l.