[libre-riscv-dev] register requirements of SimpleV

Wed Oct 10 05:55:51 BST 2018

On Tue, Oct 9, 2018 at 4:45 AM lkcl <lkcl at libre-riscv.org> wrote:

> > Note that each of the 4 pixels doesn't have to be part of the
> > same vector, but they do have to be in registers otherwise you will end
> up
> > spilling for the very common case of needing derivatives for
> > level-of-detail calculations when using mip-mapped textures (probably the
> > most common texturing mode). It would be handy to have a group be able to
> > go up to 16 elements, but I think we can get away with 4 elements.
>
>  could you let me know if you think that will fit into either the
> packed/non-packedSIMD modes?
>
I think some of the groups could be represented by packed/non-packed SIMD
modes, however not all of them can be. In order to better explain what I
meant, and since e-mail doesn't really support tables, I wrote a Markdown
document describing my envisioned grouping system:
https://git.libre-riscv.org/?p=kazan.git;a=blob;f=docs/SimpleV+Grouping+Proposal.md;hb=HEAD

One critical point is that VL is the number of element groups, not the
total number of elements as that makes it easier to use as VL can be set
once or a few times rather than needing to be changed everytime we need a
different vector.

I'm open to changing the semantics for unused portions of registers as what
I proposed may not be the best solution. I chose those particular semantics
as that matches RV's behavior for scalar values (sign-extending integers
and 1-extending floating-point).

>
> > If you
> > want to save power, we should support non-power-of-2 group sizes,
> > preventing f32x3 vectors from needing to use f32x4 and ignoring the
> fourth
> > element.
>
>  yyyyeah i think that's where a multi-issue architecture has an
> advantage over something that's SIMD-converted-to-SV.  the CSR table
> for SV is extremely compact, i'm very reluctant to extend the
> element-width size beyond 2 bits (default, default/2, default*2,
> 8-bit), and i believe - correct me if i'm wrong here - you could just
> set VL=3 (or 12 to get 4 of them?), set non-packed-SIMD mode, set
> elwidth=default/2 (on RV64) and do the add.  RV32 would just do
> elwidth=default, still setting VL=3 (or 12 to cover 4 of them?) and
> non-packed-SIMD mode.
>
> if the application has a for-loop that means 4 of those f32x3 ops can
> be done together, in a SIMD microarchitecture it's kept 100% occupied.
>
> getting the data into and out of f32x3 is ok as well (even f16x3 would
> be ok) because that's just a straight LOAD/STORE (or redirected C.LWSP
> where x2 - the stack pointer - were targetted at a different
> register).  C.LWSP has been overloaded to be a Vector "Unit Stride",
> so you set the elwidth to 16 (or 32), and boom, VL elements get pushed
> into contiguous parts of the target register(s).
>
> > >  btw for RV32, that means that the "reach" of VL is only 32 (or, 31
> > > really, as you don't vectorise x0).  RV64 has twice the available
> > > "actual" bytes of register file to play with - 256 - and RV128 has
> > > twice that (512 bytes).
> > >
> > >  so... i know you were looking to use RV32 as a base: it may be more
> > > optimal to look at RV64?  (something to evaluate?)
> > >
> > I was thinking of going with RV64 ever since you pointed out we could use
> > the GPU as the main CPU, RV64GC basically being required for common
> > software compatibility.
>
>  yehyeh.  whilst there is LLVM for RV32, there's really no proper
> linux kernel support for it.  Andes is working on it, apparently.
>
> > >
> > >  also, each register CSR CAM entry currently fits into exactly 16
> > > bits, and there's 16 of those (and 16 predication target CAM
> > > entries... might reduce that: thinking about it) so the total number
> > > of bytes to save on context-switch is 68 (only 17 32-bit words) which
> > > is not all that heavy as it could be, given that the standard RV64
> > > register files are 256 bytes (each) and for SV as it stands they would
> > > be 512 bytes (each).
> > >
> > Assuming we only need to switch the upper registers when switching to a
> new
> > user-space process as we can just have the kernel not be allowed to use
> the
> > registers that can't be accessed without SV, switching 128*8B with ram
> > access bandwidth at 300MB/s gives us about 6.8us total (including write
> old
> > state and read new state), which seems reasonable as the rendering
> threads
> > can be pinned to particular cores and we can eliminate that entirely from
> > threads that have the high registers in
>
>  i'm reluctant to make any special-case (non-uniform) logic on the
> register file, the lesson from x86 is pretty clear there :)   it's
> already extremely hairy with C, restricting registers to x8-x15, and
> the complete lack of any explicit mention of the fact that C.LDSP uses
> *exactly* the same opcode as C.FLWSP, one being active in RV32 mode
> and the other being active in RV64 mode, i mean... for goodness sake!
> spike-sv is getting really, really hairy, already.
>
>  what i _am_ planning to do is to propose a convention where at the
> start (or even during) a function call, one register is reserved as a
> "i am using registers x5-7, a0, a7, t0 and t1" register.  in this way,
> if ever a context-switch is required, and that "by convention"
> register is used as a predicate, *only* the actual registers that are
> in use at any given time will need to be saved.

> that of course assumes that it's ok to dedicate one reg to that
> purpose.  also, i was happy with VL=XLEN (-1)=RegfileSZ because it
> means a one-for-one map during that context-switch trick.  with the SV
> regfile size now being 128, that slightly goes out the window....
> unless elwidth is set to default*2, in which case it will work fine...
> but you can only do 2-at-a-time.  which is... okay.  ish.
>
> or, oh!  you're talking about just setting a software-only convention,
> right?  that the linux kernel not be made "aware" of the existence of
> the register file extension?  yeah... that would work.  horrible, but
> it would work :)
>
A better solution would be to make all of the upper registers callee-saved
in the kernel ABI, so if the kernel needs to use them, they are saved and
restored around where they are used.

>
> > >
> > > so, first thing to check: are you aware of the "packed SIMD" field,
> > > and what it does?  it basically says that registers are to be treated
> > > like any other SIMD/MMX/SSE register, and is where the element width
> > > field really comes into play.  when "packed SIMD" is enabled,
> > > individual predication bits are sent to a *block* of elements, rather
> > > than one per element.
> > >
> > I am aware of it, however I had temporary forgotten. My asking for 256
> > registers meant 256x32bit. Going to 64-bit means 128x64.
>
>  ok so 7-bit and assuming RV64, it's going to be enough?
>
I think it will be enough.

>
> > > do you think that might cover the scenarios envisaged?  4 pixels @
> > > 8R,8G,8B,8A, specifying packed SIMD mode, 2 of those would fit into 1
> > > RV64 register, so 16 elements would fit into 8 RV64 registers... it's
> > > still quite a lot, yet not as mad as it might otherwise be.
> > >
> > 4 pixels @ 2x f32x4 is definitely necessary for even a bare minimum
> shader
> > as 1 of the f32x4 is used for screen position, depth, and perspective
> > correction, and the other f32x4 is used for color output.
>
>  ... and you need to do those all at the same time, in registers,
> because the processing on them is critically inter-dependent, which in
> turn means if they're not in registers all at once you're hosed as
> you'd have to drop some to memory, then back, then back again.
>
Yup.

>
>  okaaay i think i get it now.
>
> > In almost all
> > cases the fragment shaders will have floating point outputs from the
> user's
> > shader program that are then converted as part of the fixed-functionality
> > (that we will probably implement in software) to RGBA8888 before being
> > written to the framebuffer.
>
>  FVCT with support for src_width=32(16?), dest_width=8, should do the
> trick, i believe?  so an FVCT with a vector of float16 as src, and
> uint8 as dest, set packedSIMD and blam.
>
>  would that do it?
>
That should work, though note that I think the source is 4x32-bit fp (that
may be required by the Vulkan spec, icr. I was planning on using fp32 for
simplicity's sake anyway).

>  ok, cool.  last time i asked on isa-dev, allen baum recommended the
> trick of setting a specific memory address for the tile buffer, which,
> if accessed, would put the data into that, *not* put it into main
> memory.  i believe it's *in between* the L1 and L2 cache, it could
> even just be right before the L1 cache (but crucially, after the
> Virtual Memory TLB phase, so that the VM can control which processes
> are allowed access to the scratch RAM).
>
That's kinda what I was planning, though I was thinking of using it as part
of the L1 cache when not rendering. I think it might be useful to have the
tile-cache be thread-local memory though have not thought through that.

>
>  i *believe* that the shakti group have actually implemented this,
> already: we can look at their source code (and also ask them).
>
> > Using a 4xFMA with a banked register file where the bank is selected by
> the
> > lower order register number means we could probably get away with 1Rx1W
> > SRAM as the backing memory for the register file, similarly to Hwacha.
>
>  okaaay.... sooo... we make an assumption that the top higher "banks"
> are pretty much always going to be "vectorised", such that, actually,
> they genuinely don't need to be 6R-4W (or whatever).
>
Yeah pretty much, though I had meant the bank number comes from the
least-significant bits of the 7-bit register number.

>
>  if someone really does do that, you just take a pipeline stall and be
> done with it?
>
Yes.

Jacob