[libre-riscv-dev] [Bug 139] Add LD.X and ST.X? Strided

Wed Oct 9 11:43:35 BST 2019

http://bugs.libre-riscv.org/show_bug.cgi?id=139

--- Comment #61 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #59)
> (In reply to Luke Kenneth Casson Leighton from comment #56)
> > SUBVL-predication mode idea
> > ----------------------
> > 
> > one possible solution here is to (somehow) jam in a mode which says
> > "hey, you know we said SUBVL doesn't have predication, and that the
> > predicate bit applies to *all* of SUBVL? well, um, for this operation,
> > it does".
>
> That would require calculating the expanded predicate a lot, since most
> performance-critical instructions will be predicated.

predicate bits go directly into the "shadow" side of the element-issue
Dependency Matrices.  there's no computation involved... oh wait, you
mean when setting them up?  yes if it's in a register... hmmm ok scratch
that idea.

> 
> Not sure if the extra instructions are worth it.

agreed.

> Also, if we ever want to make a high-performance GPU, VL being limited to 16
> would be a major drawback (AMDGPU uses 32 (only on Navi) or 64, I think
> NVIDIA uses 32 though could be wrong)

yehyeh.  ok scratch that one.

> > 8-bit-swizzle idea
> > -------------
> > 
> > going back to 8-bit on swizzle rather than 12-bit, the remaining 4 bits
> > can be used as a predicate SUBVL immediate DESTMASK.
> > 
> > that solves the issue of whether it's run-time safe (in the immediate
> > case).
> > 
> > also, with DESTSUBVL being redundant (directly equivalent *to* DESTMASK)
> > that gives two bits back [useable for other opcodes]
> 
> actually only 1, since SUBVL <= 3 have SUBVL in what would be the upper bits
> of the immediate
> 
> > 
> > i'd really like to know why other GPUs only have 8-bit swizzle,
> > rather than having constants.  is setting from constants that common
> > that they really *need* special treatment?
> 
> setting from constants is quite common, though a large part of the reason
> why I picked 3 bits per element is to support [f]swizzle2.

that doesn't make any immediate/logical sense to me.

> > 
> > 
> > SVP for 1 swizzle, opcode for the other idea
> > ---------------------------------------
> > 
> > we definitely do not have room to fit 2 swizzles (16 bit) and it seems
> > from Midgard that they're certainly needed.
> 
> I not so sure 2 swizzles per instruction are needed... seems excessive. I'm
> guessing Midgard just did that because they had the room.

i'd like to work out why before we make any firm decisions.

> > VBLOCK covers the multi swizzle scenario fine, but SVP does not.
> > if one swizzle immediate can be jammed into the SVP64 prefix,
> > the other can be in the operation.  if the SVP64 swizzle prefix is not
> > used, the "rules" state that the same swizzle (in the opcode)
> > applies to *both* operands.
> 
> I still think swizzles should be separate operations and macro-op fusion can
> be used if merging them with an ALU op is needed.

VBLOCK has room reserved for swizzles, so they can be applied to
operations.  the entire principle on which VBLOCK is based is to use
VBLOCK to "compactify" SVP encodings which would otherwise take up
significantly more space.

thus - one very important thing - concepts added to SVP *must* be
mappable (without loss of functionality) *to* VBLOCK in an easily
translatable fashion.  no exceptions.

here however we're looking at adding OP32 *instructions*, so it is new
territory.

thinking about it a little: i'd really like it to be possible to replace
(translate) these [new] opcodes with a VBLOCK applied to a straight C.MV.
if that's even practical.  and if it actually saves space.

usually it does (for groups of larger than 3 SVP instructions).

argh this is complicated! no wonder nobody in software/hardware libre
has designed a GPU before! :)

-- 
You are receiving this mail because:
You are on the CC list for the bug.