[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Fri Jan 25 21:39:57 GMT 2019

On Fri, Jan 25, 2019, 05:34 Luke Kenneth Casson Leighton <lkcl at lkcl.net
wrote:

> On Friday, January 25, 2019, Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > On Fri, Jan 25, 2019 at 2:24 AM Luke Kenneth Casson Leighton <
> > lkcl at lkcl.net>
> > wrote:
> >
> > > hiya jacob, ok so i had a couple of days to think.
> > >
> > > my main concern about modifying encoding of RV instructions is that SV
> > > then becomes a dead-end as far as wider adoption is concerned, due to
> > > the violation of the rule that "no instructions shall need
> > > modification to be made parallel".  people considering adopting SV on
> > > custom (or future) extensions may not have the same operands or types
> > > of operands / modifiers, and the lack of clarity and simplicity makes
> > > them stay away.
> > >
> > Yeah, I was starting to get a little concerned about that, the prefix
> > proposal doesn't exactly have a consistent pattern as it is. I think I
> went
> > a little overboard with the different prefix formats.
> >
> >
> :) if you hadn't, there would be nothing to compare against / evaluate.
>
>
> > >
> > > also: if we create what is effectively a new encoding, we might as
> > > well stop entirely on SV, and implement RVV plus some custom
> > > RVV-xBitManip extensions.  it would be a lot less work, particularly
> > > software-wise.
> > >
> > Yeah, leave that for a backup plan if SV turns out to not work or we run
> > out of time or something.
> >
> >
> Agreed.
>
>
> > >
> > If we switch the 2-bit scalar/vector to 1-bit scalar/vector-mod-4, we
> will
> > have more bits left over.
>
>
> True. Concern / issue: mod4 means 32bit elements are actually 8 per 64bit
> group.
>
> Not sure what to do about that.
>
>
> > Assuming we do implement scalar/vector-mod-N we should use moving the MSB
> > to the LSB to save muxes, like the rest of RISC-V, rather than shifting:
>
>
> Huh, how about that. Always wondered why tge encoding was so weird.
>
>
> >         // sends abcde to deabc00
> >         ((reg_num as u7 & 0x3) << 5) | (reg_num as u7 & 0x1C)
>
>
> Who knew :)
>
>
> > We may want to define a slightly different transformation for C
> > instructions, to allow the 3-bit register fields to be the most useful.
>
>
> Same trick, yes agreed.
>
>
> >
> > Also, I think that we need to not have seperate source and dest elwidths
> > except on mv/conv since we can then dedicate a clock cycle to the type
> > conversions rather than trying to pack it in every operation.
>
>
> >
> Ah yes very good point. That would mean elwidth stays same for src and
> dest, not so wasteful on routing side.
>
> Hmmm.... hmm... however if routing is needed for mv/conv, it could be used
> for 2src ops too. Except... 2src ops means double the routing bandwidth
> (and width conversions) or a clock cycle penalty.
>
> Complex.
>
> Prefer the suggestion you made. Single src conversion to single dest.
>
> That means 4 prefix bits freed for ALU ops. FP ops have 16/32/64/128 as
> part of RV encoding already (except C). INT ops are weird.
>
> I think save at least 1 bit for doing something for C ops and also 32 bit
> int. That way it may be possible to at least get C ops to do 32 bit FP
> elements (they can't right now except by setting RV32 Mode)
>
> Really, I prefer 2 for 32bit int ops and all C ops, that way it's always
> possible to specify 8/16/32/default.
>
32-bit int ops can have a single bit that switches from 32/default
(OP-32/OP) to 8/16.
We can have a different prefix encoding for C ops to have the required 2
bits, or, since they should expand 1:1 to 32-bit ops, we can just have
combinations for the most common prefixes, requiring full instructions for
uncommon cases.

> > that would leave 2 bits spare which could be used for more
> > > operation-specific uses such as LD/ST behaviour.
> > >
> >
> > > what do you think?
> > >
> > Yeah, sounds good. If we don't have enough for LD/ST, we can always add
> > custom instructions (not by abusing the prefix system).
>
>
> Crucial strategic op missing is MVX:
> regs[rd]= regs[regs[rs1]]
>
we could modify the definition slightly:
for i in 0..VL {
    let offset = regs[rs1 + i];
    // we could also limit on out-of-range
    assert!(offset < VL); // trap on fail
    regs[rd + i] = regs[rs2 + offset];
}

The dependency matrix would have the instruction depend on everything from
rs2 to rs2 + VL and we let the execution unit figure it out. for
simplicity, we could extend the dependencies to a power of 2 or something.

>
> However this is a pig to implement in hw, when it becomes parallel, even
> more so. I did however come up with a schroedinger scheme for predication,
> the predicated ops are allocated to ALUs, which depend on a special
> predication FU and hold a write hazard.
>
> When the predicate is free to be read by the special PrFU, it sends either
> "die" or releases the write hazard line.
>
> I think same thing can be done for MVX. Split into 2 phases (2 FUs), one
> which reads the regfile, &s with 0x7f (whatever), then passes that through
> to 2nd phase to look up in regfile.
>
> Only thing is, damn, it messes up the dependencies. You can't proceed
> further with instruction issue (not to an OoO engine) until all of those
> 2nd phase regfile lookups are known.
>
mvx is a last resort instruction. We definitely need it because we can
implement it in HW to be up to several times faster than the fallback
(bunch of st/ld or bunch of scalar mv) and much less instruction issue
bandwidth and energy than the fallback.

We should add some constrained swizzle instructions for the more
pipeline-friendly cases. One that will be important is:
for i in (0..VL) {
    let i = i * 4;
    let s1: [0; 4];
    for j in 0..4 {
        s1[j] = regs[rs1 + i + j];
    }
    for j in 0..4 {
        regs[rd + i + j] = s1[(imm >> j * 2) & 0x3];
    }
}
Another is matrix transpose for (2-4)x(2-4) matrices which we can implement
as similar to a strided ld/st except for registers.

Note that all of the above operations should be operating on elements, not
registers.

>
> Reason: only when all the 1st phase regfile lookups are known do you know
> which hazards need to be created in the Dependency Matrices.
>
> It would be much easier to have REMAP/SHAPE, as that does not involve
> creating a 2 phase decode that blocks even the instruction decode phase.
>
Same reasoning as mv/conv applies here: you don't need that nearly often
enough to dedicate the extra hw/encoding bits to allow that for every
operation.

>
> If only 1 reg in the proposed new op contained the map, something similar
> to xbitmanip butterfly or  REMAP permutations, at least just like for
> predication the instruction decode phase would be held up waiting for only
> 1 reg read, not VL reg reads.
>

Jacob

>