[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Mon Oct 19 23:36:46 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #72 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #70)
> (In reply to Jacob Lifshay from comment #64)
> 
> > By contrast, using 8-bit lanes for masks means we'd have to add extra logic
> > to handle VL > 8 and we'd have to handle scaling the result (an extra shift
> > instruction), and we'd have to handle making sure that lanes have all bits
> > set before inverting them. If we instead decide to have an on lane generate
> > 0xFF instead of 0x01, then popcount is likewise messed up.
> > 
> > All of the above mess is solved efficiently by just having 1 bit per lane.
> 
> ok, the problem is that it's not that simple (never is).  there is no
> concept of "lanes" in SV.  or there is: they're the ALU widths (which will
> be either 64-SIMD or we did discuss doing 32-SIMD and splitting the regfile
> into HI-32 and LO-32, so that 64-bit operations need a pair of 32-wide ALUs
> to collaborate)
> 
> these ALU widths are completely divorced from architectural (ISA, SV)
> element widths, and consequently no amount of choice of bit-width for
> predicate lanes - whether it be 8-bit, 16-bit, is going to cut it.
> 
> the reason is because:
> 
> * when you request elwidth=8bit operations, you need *8* predicate bits
>   to be allocated (routed) to a given 64-bit SIMD ALU

All you need is a bitwise right shifter to send the next 8 bits from the vector
mask to the ALU, the alu can have the few muxes needed to pick out which of the
bits it needs from those 8 bits. I'd estimate somewhere around 30 muxes max,
assuming it's translated to byte-level enables.

> * when you request elwidth=16bit, that's 4 predicate bits
> * elwidth=32 bit that's 2 predicate bits
> * elwidth-64 bit is only 1
> 
> the routing and DM allocation on that - the subdivision of the 8-bit masks
> concept - is going to be a pig.

DM allocation should be pretty simple -- only needed at decode time, just
(assuming integer masks registers are split into 16 subregisters) 4 16-bit
decoders with their outputs ORed together, one decoder for 8, 16, 32, and
64-bit elements.
> 
> 
> > Vectorized CRs still have a bunch of the above mess, because they aren't 1
> > bit per lane.
> 
> again you're conflating the (false/inapplicable) concept of "lanes" as being
> an architectural concept in SV elements, where it can't actually be applied.
> i know it works in Cray-style Vector ISAs, but it doesn't work here.

I'm using the word "lane" to mean the same thing as "element" since it's
shorter to type, I'm ignoring subvectors for now. A lane/element is the thing
that VL counts. There is a single conceptual bool per lane/element in a vector
mask. a lane/element can be 8/16/32/64-bits (or up to 256-bits with subvectors
-- 64x4-bits).
> 
> the only thing that's really going to work is to have *element* based
> predicates.  Cray-style architectures (including RVV) do this by allocating
> an entire element of a vector as a predicate (ignoring all but the *one*
> LSB).

IIRC, the Cray-1 uses a mask register with 1 bit per lane/element -- similar to
what I'm advocating for.

> 
> our equivalent is "registers".  actual scalar registers.
> 
> in other words: to solve the problem that you highlighted (overlaps) we
> *need* each predicate to be in *independent* scalar registers.

no, we just need each predicate to be separate microarchitectural registers --
we *don't* need separate ISA-level registers. hence why I'm advocating for
having about 2 ISA-level integer registers to be split into many separate
microarchitectural registers, all other integer registers are still 1:1 with
microarchitectural registers (ignoring splitting into 8/16/32/64-bit
element-sized pieces).
> 
> and it turns out that PowerISA has something that we happen to already have
> planned to allocate DM space for them, even though they're only 4 bit wide:
> CRs.

If we decide to go with the design I'm advocating for, there are 1x 32-bit CR
split into 8x 4-bit subregisters (standard scalar), 128x 64-bit FP registers,
126x 64-bit integer registers not optimized for masks, and 2x 64-bit integer
registers optimized for masking by each being split into 8 or 16 subregisters
which are interleaved to form the whole 64-bit register, as described in
comment 53.
> 
> so Vectorised CRs _are_ a bit of a mess, but they're a mess because unusual
> bitmanip ops don't exist for them (only AND/OR/NAND/XOR etc.) and that can
> be solved by just vectorising mfcr, and running int scalar bitmanip ops. 
> which we can macro-op fuse if we really want to (later).

mfcr has the wrong semantics, since we want each element to be 1 bit, but mfcr
has each element being 4 bits. We would need to add a new instruction, or, we
can bypass that whole mess by using integer registers as I suggested.
> 
> 
> > Also, they have a ISA-level limiting effect on large VLs
> > because of quickly running out of the 64 CRs when you need multiple masks
> > (common in non-trivial shaders).
> 
> i think we can solve that one by doing 128 CRs.  that gives a total of
> 128x4=512 bits worth of predicate mask space.  and intregs can be used as
> "spill" if we absolutely have to.

it's much cleaner at the ISA-level to just use integer registers instead of
trying to force CRs to work. Also, for most practical purposes, because of how
CRs are written by almost all instructions, you'd only have 128-bits of
effective mask space -- exactly as much as I'm proposing for the integer
register version. Additionally, we wouldn't have to deal with the extra
ISA-level mess created by more CRs, since those integer registers appear just
like normal integer registers. Additionally, we wouldn't need to decode extra
instructions just to be able to use popcount on a mask, saving icache space as
well as time due to not needing as many instructions to do the same thing.

-- 
You are receiving this mail because:
You are on the CC list for the bug.