[libre-riscv-dev] register requirements of SimpleV

Tue Oct 9 09:58:55 BST 2018

On Mon, Oct 8, 2018 at 10:52 AM Jacob Lifshay <programmerjake at gmail.com> wrote:

> Before the implementation of SimpleV gets too far along, I think it would
> be useful to increase the max number of registers from 64 to 128 or 256 as
> fragment shaders require running at least 4 pixels simultaneously for
> calculating screen-space derivatives (for mipmapping, among other things)
> and it's very common to use 4-component vectors for things like colors
> (rgba) and positions (using homogenous coordinates) so each 4-component
> vector is replicated 4 times (for each pixel) meaning that each vector will
> need 16 elements, and with vectors that big, we will quickly run out of
> registers.

 yowser, ok.  i was vaguely planning to extend to 256 as a future
option, however by keeping things within XLEN bits, it means that
predication fits into a single (scalar) register as a bitfield.
otherwise it would be necessary to treat the predication as a
contiguous vector, extending across multiple registers as well.

 btw for RV32, that means that the "reach" of VL is only 32 (or, 31
really, as you don't vectorise x0).  RV64 has twice the available
"actual" bytes of register file to play with - 256 - and RV128 has
twice that (512 bytes).

 so... i know you were looking to use RV32 as a base: it may be more
optimal to look at RV64?  (something to evaluate?)

 also, each register CSR CAM entry currently fits into exactly 16
bits, and there's 16 of those (and 16 predication target CAM
entries... might reduce that: thinking about it) so the total number
of bytes to save on context-switch is 68 (only 17 32-bit words) which
is not all that heavy as it could be, given that the standard RV64
register files are 256 bytes (each) and for SV as it stands they would
be 512 bytes (each).

so, first thing to check: are you aware of the "packed SIMD" field,
and what it does?  it basically says that registers are to be treated
like any other SIMD/MMX/SSE register, and is where the element width
field really comes into play.  when "packed SIMD" is enabled,
individual predication bits are sent to a *block* of elements, rather
than one per element.

do you think that might cover the scenarios envisaged?  4 pixels @
8R,8G,8B,8A, specifying packed SIMD mode, 2 of those would fit into 1
RV64 register, so 16 elements would fit into 8 RV64 registers... it's
still quite a lot, yet not as mad as it might otherwise be.

if however float16 is needed, and doing 4 of those at a time is
*really* needed, that's 16 registers for a single vector: keeping it
clean and only using the top 32 registers that's half the
(sensibly-allocated) register file just for one operand.

if we can work out the actual number of register file *bytes* needed -
the actual size of the register file RAM - rather than the number of
*registers* we have much more information to assess whether extending
to 128 or 256 is necessary.

also, this is straying into near-supercomputer territory.  i have a
specific target (from an investor / customer) of reaching MALI400 /
GC800 level performance, which is around 5-6 GFLOPS/sec, 100
MPixels/sec (enough to do 1280x720 at 30fps), and around 30
MTriangles/sec, in under 2.0 watts for the entire SoC.  it's very
modest, and i would be concerned that extending to 256 registers would
easily blow away the die size and thus the power budget.

Broadcom VideoCore IV is now very well-documented, it has only 2
32-entry register files (one int, one fp).  it only has "virtual"
parallelism on 16-wide FP numbers, so the actual ALU takes 4 at a
time, and a hardware-macro-loop puts 4 batches of those at it.

i also found some reverse-engineered notes on MALI ("midgard"):
https://github.com/cwabbott0/mali-isa-docs/blob/master/Midgard.md

it says that they have 32 128-bit registers which can be overloaded to
break into 4 32-bit or 8 16-bit.  which tends to suggest that yes,
going to 7-bit (128) might be a good idea.

*thinks*.... darn-it.  ok, i may be able to save 1 byte by separating
out the FP and INT CSRs into separate CAMs:
union sv_reg_csr_entry {
    struct {
        uint64_t     regkey : 5; // 5 bits
        unsigned int elwidth: 2; // 0=8-bit, 1=dflt, 2=dflt/2 3=dflt*2
        unsigned int type   : 1; // 1=INT, 0=FP
        uint64_t     regidx : 6; // yes 6 bits
        unsigned int isvec  : 1; // vector=1, scalar=0
        unsigned int packed : 1; // Packed SIMD=1
    } b;
    unsigned short u;
};

becomes:
union sv_reg_csr_entry {
    struct {
        uint64_t     regkey : 5; // 5 bits
        unsigned int elwidth: 2; // 0=8-bit, 1=dflt, 2=dflt/2 3=dflt*2
        uint64_t     regidx : 7; // yes 7 bits
        unsigned int isvec  : 1; // vector=1, scalar=0
        unsigned int packed : 1; // Packed SIMD=1
    } b;
    unsigned short u;
};

that's a frickin large register file, slightly freaking me out :)

l.