[libre-riscv-dev] register requirements of SimpleV

Tue Oct 9 12:45:13 BST 2018

On Tue, Oct 9, 2018 at 11:20 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
> On Tue, Oct 9, 2018 at 1:59 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:

> >  yowser, ok.  i was vaguely planning to extend to 256 as a future
> > option, however by keeping things within XLEN bits, it means that
> > predication fits into a single (scalar) register as a bitfield.
> > otherwise it would be necessary to treat the predication as a
> > contiguous vector, extending across multiple registers as well.
> >
> 32/64 bits per predication register is sufficient if it is 1 bit per
> "group" where a group might be f32x4 or u8x1.

 yes.  f32x4 is the "packed simd" mode, u8x1 is "non-packed".

> If SV could be modified to
> operate in element groups where the vector length register is in units of
> groups and the group size could be set per-input-register, that would
> eliminate the predication register problem and fit more closely with the
> proposed IR form for variable-length vectors in LLVM, allowing easier code
> generation.

 i believe this is possible.  packedSIMD is a little weird as it's
basically "vectorising otherwise-standard SIMD operations".  i.e. if
you were to forget entirely that the SIMD operation is in fact a SIMD
operation, just treat it as any other opcode, then that means VL is
*not* multiplied by the total number of elements, VL is in fact = the
number of *SIMD operations*.

 if however packed SIMD is *not* set, then VL corresponds with the
number of *elements*.

 packed SIMD is a mode that i thought might be really really sensible
to have, as hypothetically where without it you could have up to 64
(ok ok 63) individually predicated u8 elements; *with* packed SIMD you
could make that a whopping 63*8 u8 elements, which is enormous.

> Note that each of the 4 pixels doesn't have to be part of the
> same vector, but they do have to be in registers otherwise you will end up
> spilling for the very common case of needing derivatives for
> level-of-detail calculations when using mip-mapped textures (probably the
> most common texturing mode). It would be handy to have a group be able to
> go up to 16 elements, but I think we can get away with 4 elements.

 could you let me know if you think that will fit into either the
packed/non-packedSIMD modes?

> If you
> want to save power, we should support non-power-of-2 group sizes,
> preventing f32x3 vectors from needing to use f32x4 and ignoring the fourth
> element.

 yyyyeah i think that's where a multi-issue architecture has an
advantage over something that's SIMD-converted-to-SV.  the CSR table
for SV is extremely compact, i'm very reluctant to extend the
element-width size beyond 2 bits (default, default/2, default*2,
8-bit), and i believe - correct me if i'm wrong here - you could just
set VL=3 (or 12 to get 4 of them?), set non-packed-SIMD mode, set
elwidth=default/2 (on RV64) and do the add.  RV32 would just do
elwidth=default, still setting VL=3 (or 12 to cover 4 of them?) and
non-packed-SIMD mode.

if the application has a for-loop that means 4 of those f32x3 ops can
be done together, in a SIMD microarchitecture it's kept 100% occupied.

getting the data into and out of f32x3 is ok as well (even f16x3 would
be ok) because that's just a straight LOAD/STORE (or redirected C.LWSP
where x2 - the stack pointer - were targetted at a different
register).  C.LWSP has been overloaded to be a Vector "Unit Stride",
so you set the elwidth to 16 (or 32), and boom, VL elements get pushed
into contiguous parts of the target register(s).

> >  btw for RV32, that means that the "reach" of VL is only 32 (or, 31
> > really, as you don't vectorise x0).  RV64 has twice the available
> > "actual" bytes of register file to play with - 256 - and RV128 has
> > twice that (512 bytes).
> >
> >  so... i know you were looking to use RV32 as a base: it may be more
> > optimal to look at RV64?  (something to evaluate?)
> >
> I was thinking of going with RV64 ever since you pointed out we could use
> the GPU as the main CPU, RV64GC basically being required for common
> software compatibility.

 yehyeh.  whilst there is LLVM for RV32, there's really no proper
linux kernel support for it.  Andes is working on it, apparently.

> >
> >  also, each register CSR CAM entry currently fits into exactly 16
> > bits, and there's 16 of those (and 16 predication target CAM
> > entries... might reduce that: thinking about it) so the total number
> > of bytes to save on context-switch is 68 (only 17 32-bit words) which
> > is not all that heavy as it could be, given that the standard RV64
> > register files are 256 bytes (each) and for SV as it stands they would
> > be 512 bytes (each).
> >
> Assuming we only need to switch the upper registers when switching to a new
> user-space process as we can just have the kernel not be allowed to use the
> registers that can't be accessed without SV, switching 128*8B with ram
> access bandwidth at 300MB/s gives us about 6.8us total (including write old
> state and read new state), which seems reasonable as the rendering threads
> can be pinned to particular cores and we can eliminate that entirely from
> threads that have the high registers in

 i'm reluctant to make any special-case (non-uniform) logic on the
register file, the lesson from x86 is pretty clear there :)   it's
already extremely hairy with C, restricting registers to x8-x15, and
the complete lack of any explicit mention of the fact that C.LDSP uses
*exactly* the same opcode as C.FLWSP, one being active in RV32 mode
and the other being active in RV64 mode, i mean... for goodness sake!
spike-sv is getting really, really hairy, already.

 what i _am_ planning to do is to propose a convention where at the
start (or even during) a function call, one register is reserved as a
"i am using registers x5-7, a0, a7, t0 and t1" register.  in this way,
if ever a context-switch is required, and that "by convention"
register is used as a predicate, *only* the actual registers that are
in use at any given time will need to be saved.

that of course assumes that it's ok to dedicate one reg to that
purpose.  also, i was happy with VL=XLEN (-1)=RegfileSZ because it
means a one-for-one map during that context-switch trick.  with the SV
regfile size now being 128, that slightly goes out the window....
unless elwidth is set to default*2, in which case it will work fine...
but you can only do 2-at-a-time.  which is... okay.  ish.

or, oh!  you're talking about just setting a software-only convention,
right?  that the linux kernel not be made "aware" of the existence of
the register file extension?  yeah... that would work.  horrible, but
it would work :)

> >
> > so, first thing to check: are you aware of the "packed SIMD" field,
> > and what it does?  it basically says that registers are to be treated
> > like any other SIMD/MMX/SSE register, and is where the element width
> > field really comes into play.  when "packed SIMD" is enabled,
> > individual predication bits are sent to a *block* of elements, rather
> > than one per element.
> >
> I am aware of it, however I had temporary forgotten. My asking for 256
> registers meant 256x32bit. Going to 64-bit means 128x64.

 ok so 7-bit and assuming RV64, it's going to be enough?

> > do you think that might cover the scenarios envisaged?  4 pixels @
> > 8R,8G,8B,8A, specifying packed SIMD mode, 2 of those would fit into 1
> > RV64 register, so 16 elements would fit into 8 RV64 registers... it's
> > still quite a lot, yet not as mad as it might otherwise be.
> >
> 4 pixels @ 2x f32x4 is definitely necessary for even a bare minimum shader
> as 1 of the f32x4 is used for screen position, depth, and perspective
> correction, and the other f32x4 is used for color output.

 ... and you need to do those all at the same time, in registers,
because the processing on them is critically inter-dependent, which in
turn means if they're not in registers all at once you're hosed as
you'd have to drop some to memory, then back, then back again.

 okaaay i think i get it now.

> In almost all
> cases the fragment shaders will have floating point outputs from the user's
> shader program that are then converted as part of the fixed-functionality
> (that we will probably implement in software) to RGBA8888 before being
> written to the framebuffer.

 FVCT with support for src_width=32(16?), dest_width=8, should do the
trick, i believe?  so an FVCT with a vector of float16 as src, and
uint8 as dest, set packedSIMD and blam.

 would that do it?

> > also, this is straying into near-supercomputer territory.  i have a
> > specific target (from an investor / customer) of reaching MALI400 /
> > GC800 level performance, which is around 5-6 GFLOPS/sec, 100
> > MPixels/sec (enough to do 1280x720 at 30fps), and around 30
> > MTriangles/sec, in under 2.0 watts for the entire SoC.  it's very
> > modest, and i would be concerned that extending to 256 registers would
> > easily blow away the die size and thus the power budget.
> >
> I don't know about power, however I have done some research and a 4Kbyte
> (or 16, icr) SRAM (what I was thinking of for a tile buffer) takes in the
> ballpark of 1000 um^2 in 28nm.

 ok, cool.  last time i asked on isa-dev, allen baum recommended the
trick of setting a specific memory address for the tile buffer, which,
if accessed, would put the data into that, *not* put it into main
memory.  i believe it's *in between* the L1 and L2 cache, it could
even just be right before the L1 cache (but crucially, after the
Virtual Memory TLB phase, so that the VM can control which processes
are allowed access to the scratch RAM).

 i *believe* that the shakti group have actually implemented this,
already: we can look at their source code (and also ask them).

> Using a 4xFMA with a banked register file where the bank is selected by the
> lower order register number means we could probably get away with 1Rx1W
> SRAM as the backing memory for the register file, similarly to Hwacha.

 okaaay.... sooo... we make an assumption that the top higher "banks"
are pretty much always going to be "vectorised", such that, actually,
they genuinely don't need to be 6R-4W (or whatever).

 if someone really does do that, you just take a pipeline stall and be
done with it?

> I
> would suggest 8 banks allowing us to do more in parallel since we could run
> other units in parallel with a 4xFMA. 8 banks would also allow us to clock
> gate the SRAM banks that are not in use for the current clock cycle
> allowing us to save more power. Note that the 4xFMA could be 4 separately
> allocated FMA units, it doesn't have to be SIMD style. If we have enough hw
> parallelism, we can under-volt and under-clock the GPU cores allowing for a
> more efficient GPU.

 yehyeh.

> If we are using the GPU cores as CPU cores as well, I
> think it would be important to be able to use a faster clock speed when not
> using the extended registers (similar to how Intel processors use a lower
> clock rate when AVX512 is in use) so that scalar code is not slowed down
> too much.

 coool :)  sophisticated.  can i recommend starting a page for these
kinds of notes?

 https://libre-riscv.org/3d_gpu/microarchitecture/ i just cut/paste
the paras above

> >
> > Broadcom VideoCore IV is now very well-documented, it has only 2
> > 32-entry register files (one int, one fp).  it only has "virtual"
> > parallelism on 16-wide FP numbers, so the actual ALU takes 4 at a
> > time, and a hardware-macro-loop puts 4 batches of those at it.
> >
> > i also found some reverse-engineered notes on MALI ("midgard"):
> > https://github.com/cwabbott0/mali-isa-docs/blob/master/Midgard.md
> >
> > it says that they have 32 128-bit registers which can be overloaded to
> > break into 4 32-bit or 8 16-bit.  which tends to suggest that yes,
> > going to 7-bit (128) might be a good idea.
> >
> > *thinks*.... darn-it.  ok, i may be able to save 1 byte by separating
> > out the FP and INT CSRs into separate CAMs:
> > union sv_reg_csr_entry {
> >     struct {
> >         uint64_t     regkey : 5; // 5 bits
> >         unsigned int elwidth: 2; // 0=8-bit, 1=dflt, 2=dflt/2 3=dflt*2
> >         unsigned int type   : 1; // 1=INT, 0=FP
> >         uint64_t     regidx : 6; // yes 6 bits
> >         unsigned int isvec  : 1; // vector=1, scalar=0
> >         unsigned int packed : 1; // Packed SIMD=1
> >     } b;
> >     unsigned short u;
> > };
> >
> > becomes:
> > union sv_reg_csr_entry {
> >     struct {
> >         uint64_t     regkey : 5; // 5 bits
> >         unsigned int elwidth: 2; // 0=8-bit, 1=dflt, 2=dflt/2 3=dflt*2
> >         uint64_t     regidx : 7; // yes 7 bits
> >         unsigned int isvec  : 1; // vector=1, scalar=0
> >         unsigned int packed : 1; // Packed SIMD=1
> >     } b;
> >     unsigned short u;
> > };
> >
> Note that bitfields in a C struct may not be the best way to represent that
> as neither the C standard nor the C++ standard specifies how they are laid
> out.

 mmmm... okok :)  it's good enough to be getting on with.  appreciate
the heads-up.

> >
> > that's a frickin large register file, slightly freaking me out :)
> >
> At least you won't have to worry about area if we can use 1Rx1W register
> files as they can be very compact.

 awesome.