[libre-riscv-dev] register requirements of SimpleV

Tue Oct 9 11:20:05 BST 2018

On Tue, Oct 9, 2018 at 1:59 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On Mon, Oct 8, 2018 at 10:52 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > Before the implementation of SimpleV gets too far along, I think it would
> > be useful to increase the max number of registers from 64 to 128 or 256
> as
> > fragment shaders require running at least 4 pixels simultaneously for
> > calculating screen-space derivatives (for mipmapping, among other things)
> > and it's very common to use 4-component vectors for things like colors
> > (rgba) and positions (using homogenous coordinates) so each 4-component
> > vector is replicated 4 times (for each pixel) meaning that each vector
> will
> > need 16 elements, and with vectors that big, we will quickly run out of
> > registers.
>
>  yowser, ok.  i was vaguely planning to extend to 256 as a future
> option, however by keeping things within XLEN bits, it means that
> predication fits into a single (scalar) register as a bitfield.
> otherwise it would be necessary to treat the predication as a
> contiguous vector, extending across multiple registers as well.
>
32/64 bits per predication register is sufficient if it is 1 bit per
"group" where a group might be f32x4 or u8x1. If SV could be modified to
operate in element groups where the vector length register is in units of
groups and the group size could be set per-input-register, that would
eliminate the predication register problem and fit more closely with the
proposed IR form for variable-length vectors in LLVM, allowing easier code
generation. Note that each of the 4 pixels doesn't have to be part of the
same vector, but they do have to be in registers otherwise you will end up
spilling for the very common case of needing derivatives for
level-of-detail calculations when using mip-mapped textures (probably the
most common texturing mode). It would be handy to have a group be able to
go up to 16 elements, but I think we can get away with 4 elements. If you
want to save power, we should support non-power-of-2 group sizes,
preventing f32x3 vectors from needing to use f32x4 and ignoring the fourth
element.

>
>  btw for RV32, that means that the "reach" of VL is only 32 (or, 31
> really, as you don't vectorise x0).  RV64 has twice the available
> "actual" bytes of register file to play with - 256 - and RV128 has
> twice that (512 bytes).
>
>  so... i know you were looking to use RV32 as a base: it may be more
> optimal to look at RV64?  (something to evaluate?)
>
I was thinking of going with RV64 ever since you pointed out we could use
the GPU as the main CPU, RV64GC basically being required for common
software compatibility.

>
>  also, each register CSR CAM entry currently fits into exactly 16
> bits, and there's 16 of those (and 16 predication target CAM
> entries... might reduce that: thinking about it) so the total number
> of bytes to save on context-switch is 68 (only 17 32-bit words) which
> is not all that heavy as it could be, given that the standard RV64
> register files are 256 bytes (each) and for SV as it stands they would
> be 512 bytes (each).
>
Assuming we only need to switch the upper registers when switching to a new
user-space process as we can just have the kernel not be allowed to use the
registers that can't be accessed without SV, switching 128*8B with ram
access bandwidth at 300MB/s gives us about 6.8us total (including write old
state and read new state), which seems reasonable as the rendering threads
can be pinned to particular cores and we can eliminate that entirely from
threads that have the high registers in

>
> so, first thing to check: are you aware of the "packed SIMD" field,
> and what it does?  it basically says that registers are to be treated
> like any other SIMD/MMX/SSE register, and is where the element width
> field really comes into play.  when "packed SIMD" is enabled,
> individual predication bits are sent to a *block* of elements, rather
> than one per element.
>
I am aware of it, however I had temporary forgotten. My asking for 256
registers meant 256x32bit. Going to 64-bit means 128x64.

>
> do you think that might cover the scenarios envisaged?  4 pixels @
> 8R,8G,8B,8A, specifying packed SIMD mode, 2 of those would fit into 1
> RV64 register, so 16 elements would fit into 8 RV64 registers... it's
> still quite a lot, yet not as mad as it might otherwise be.
>
4 pixels @ 2x f32x4 is definitely necessary for even a bare minimum shader
as 1 of the f32x4 is used for screen position, depth, and perspective
correction, and the other f32x4 is used for color output. In almost all
cases the fragment shaders will have floating point outputs from the user's
shader program that are then converted as part of the fixed-functionality
(that we will probably implement in software) to RGBA8888 before being
written to the framebuffer.

>
> if however float16 is needed, and doing 4 of those at a time is
> *really* needed, that's 16 registers for a single vector: keeping it
> clean and only using the top 32 registers that's half the
> (sensibly-allocated) register file just for one operand.
>
> if we can work out the actual number of register file *bytes* needed -
> the actual size of the register file RAM - rather than the number of
> *registers* we have much more information to assess whether extending
> to 128 or 256 is necessary.
>

>
> also, this is straying into near-supercomputer territory.  i have a
> specific target (from an investor / customer) of reaching MALI400 /
> GC800 level performance, which is around 5-6 GFLOPS/sec, 100
> MPixels/sec (enough to do 1280x720 at 30fps), and around 30
> MTriangles/sec, in under 2.0 watts for the entire SoC.  it's very
> modest, and i would be concerned that extending to 256 registers would
> easily blow away the die size and thus the power budget.
>
I don't know about power, however I have done some research and a 4Kbyte
(or 16, icr) SRAM (what I was thinking of for a tile buffer) takes in the
ballpark of 1000 um^2 in 28nm.
Using a 4xFMA with a banked register file where the bank is selected by the
lower order register number means we could probably get away with 1Rx1W
SRAM as the backing memory for the register file, similarly to Hwacha. I
would suggest 8 banks allowing us to do more in parallel since we could run
other units in parallel with a 4xFMA. 8 banks would also allow us to clock
gate the SRAM banks that are not in use for the current clock cycle
allowing us to save more power. Note that the 4xFMA could be 4 separately
allocated FMA units, it doesn't have to be SIMD style. If we have enough hw
parallelism, we can under-volt and under-clock the GPU cores allowing for a
more efficient GPU. If we are using the GPU cores as CPU cores as well, I
think it would be important to be able to use a faster clock speed when not
using the extended registers (similar to how Intel processors use a lower
clock rate when AVX512 is in use) so that scalar code is not slowed down
too much.

>
> Broadcom VideoCore IV is now very well-documented, it has only 2
> 32-entry register files (one int, one fp).  it only has "virtual"
> parallelism on 16-wide FP numbers, so the actual ALU takes 4 at a
> time, and a hardware-macro-loop puts 4 batches of those at it.
>
> i also found some reverse-engineered notes on MALI ("midgard"):
> https://github.com/cwabbott0/mali-isa-docs/blob/master/Midgard.md
>
> it says that they have 32 128-bit registers which can be overloaded to
> break into 4 32-bit or 8 16-bit.  which tends to suggest that yes,
> going to 7-bit (128) might be a good idea.
>
> *thinks*.... darn-it.  ok, i may be able to save 1 byte by separating
> out the FP and INT CSRs into separate CAMs:
> union sv_reg_csr_entry {
>     struct {
>         uint64_t     regkey : 5; // 5 bits
>         unsigned int elwidth: 2; // 0=8-bit, 1=dflt, 2=dflt/2 3=dflt*2
>         unsigned int type   : 1; // 1=INT, 0=FP
>         uint64_t     regidx : 6; // yes 6 bits
>         unsigned int isvec  : 1; // vector=1, scalar=0
>         unsigned int packed : 1; // Packed SIMD=1
>     } b;
>     unsigned short u;
> };
>
> becomes:
> union sv_reg_csr_entry {
>     struct {
>         uint64_t     regkey : 5; // 5 bits
>         unsigned int elwidth: 2; // 0=8-bit, 1=dflt, 2=dflt/2 3=dflt*2
>         uint64_t     regidx : 7; // yes 7 bits
>         unsigned int isvec  : 1; // vector=1, scalar=0
>         unsigned int packed : 1; // Packed SIMD=1
>     } b;
>     unsigned short u;
> };
>
Note that bitfields in a C struct may not be the best way to represent that
as neither the C standard nor the C++ standard specifies how they are laid
out.

>
> that's a frickin large register file, slightly freaking me out :)
>
At least you won't have to worry about area if we can use 1Rx1W register
files as they can be very compact.

Jacob