[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Tue Jan 29 09:55:49 GMT 2019

On Tue, Jan 29, 2019, 01:36 Luke Kenneth Casson Leighton <lkcl at lkcl.net
wrote:

> On Tuesday, January 29, 2019, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > On Mon, Jan 28, 2019, 20:34 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> > wrote:
> >
> > > On Sun, Jan 27, 2019 at 11:36 PM Jacob Lifshay <
> programmerjake at gmail.com
> > >
> > > wrote:
> > >
> > > > Note for Luke: this has old stuff, so don't skip over
> > >
> > meant to say new.
> >
> >
> :)
>
>
> > >  ok.
> > >
> > > > A table of the scalar/vector encodings:
> > > > 1 register, integer:
> > > > 1-bit field
> > > > v: 0
> > > > s: 1
> > >
> > >  if we have implicit scalar-vector from the register numbering-prefix,
> > > a separate field isn;t needed.
> > >
> > what I had meant is that we would have a scalar/vector indicator and the
> > base 5-bit register number field and when the indicator is vector then we
> > would convert the 5-bit field abcde to deabc00 and when the indicator is
> > scalar it would be converted to 00abcde.
>
>
> >
> Oh ok.
>
>
> > This is equivalent to the
> > scalar/vector-mod-4 scheme we previously discussed except that we only
> need
> > 2 bits of scalar/vector for 3-register integer ops, saving the bit needed
> > for elwidth. we don't need to worry about 4-register integer ops since
> > there aren't any and fp ops have an elwidth field in the underlying op so
> > we don't need one in the prefix, leaving enough bits in the prefix for
> > 4-register fp ops (fmadd).
> >
> >
> 32 bit FP ops on a mod4 boundary of 64 bit registers will be EIGHT not 4.
> The top 4 will be totally inaccessible except when VL is between the range
> 5 to 8.
>
> Is this an issue? Is it ok to waste parts of the regfile due to them being
> inaccessible?
>
I don't think it will be an issue since VL * vl-mul will usually be bigger.
It's the same as the top portion of the registers in the V extension being
wasted when VL is small.

One nice part of vl-mul is that a single instruction can write the entire
register file for context switch.

>
> >
> > >  reduce operations i decided in the original SV to not include, as it
> > > creates dependencies that i felt would be better expressed as straight
> > > loops.  instead, the for-loop for the "hardware-macro-unrolling" would
> > > simply terminate after the first element operation successfully
> > > completed, taking predication into account in that.
> > >
> > >  so VEXTRACT and VINSERT just become accidentally-implemented
> > > side-effects of the loop termination.
> >
> > I really think we should add reduce operations because they are really
> > handy in matrix multiplication, which is used in both neural nets and 3D
> > graphics.
>
>
> Ok there is a better way.  If using reduce it jams up ALUs waiting for the
> accumulator.
>
Ok, yeah, I'm convinced. Lets leave reduce out and then we can change the
scalar result to be the first enabled lane or zeroed if no lanes are
enabled.

>
> There are 3 loops in a mmult. One of them may be a vector op (i think it is
> always the inner loop that gets vectorised) If the inner one targets the
> same xy point in the dest matrix you REQUIRE a reduce op.  And jam up the
> ALUs.
>
> If however you swap the order of the 3 loops then you do not need a reduce.
> You get an accumulation over time (a *row* of the dest is calculated
> partially, in parallel) and the element ops may be issued in parallel
> inside the OoO engine, without jamming.
>
>
>
>
> >
> > a reducing version of fmadd is basically a column or row of a vl-mul by
> VL
> > matrix multiply operation with one of the input matrices transposed (aka
> a
> > vector of dot-products).
> >
> > We don't need to specify a fixed order for reduction for SV's spec, it
> just
> > needs to be deterministic, depending only on the specific operation, VL,
> > and vl-mul. This allows us to operate on 4 elements at a time for most of
> > the reduction. Order is irrelevant anyway for integer reductions.
> >
> > vextract and vinsert ops are the scalar version of the strided register
> to
> > register move (basically strided ld/st except on in-register vectors
> > instead of in-memory vectors) operations that I recommended adding
> earlier,
> > with the added benefit of not needing to build a predicate to use it.
> >
> >
> Predicate is basically 1<<pos , r to r mv requires some form of setup as
> well.
>
not necessarily as I think we should add versions that take their arguments
from registers and versions that have an immediate as the immediate field
will allow better scoreboard dependency tracking and it's by far the most
common kind.

Jacob