[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Jacob Lifshay programmerjake at gmail.com
Sun Jan 27 23:36:36 GMT 2019


Note for Luke: this has old stuff, so don't skip over

On Sun, Jan 27, 2019, 08:11 Aleksandar Kostovic <
alexandar.kostovic at gmail.com wrote:

> >
> > you forgot rs2. some ops (shift, mulhsu, sub, div, mod, etc.) are not
> > commutative, so it'd be nice to have both scalar<<vector and
> > vector<<scalar.
>
>
> So you are suggesting that we should builds custom op codes that take like
> a range of scalars and turn it into a vector and opcodes that take vectors
> and broke it down into scalars?
>
not quite, the ops don't ever take or produce more than 1 scalar per
argument when the VL multiplier is 1. so, an instruction like this:
sub.h.vsv rd, rs1, rs2, len=VL*1, pred=!rp // subtract half-word vector
scalar vector
does this operation:
union Reg // in little endian
{
    constexpr size_t u64_count = 1;
    constexpr size_t u32_count = 2;
    constexpr size_t u16_count = 4;
    constexpr size_t u8_count = 8;
    uint64_t u64[u64_count];
    uint32_t u32[u32_count];
    uint16_t u16[u16_count];
    uint8_t u8[u8_count];
};
Reg regs[128];

auto pred = ~regs[rp].u64[0];
for(uint64_t i = 0; i < VL; i++)
{
    if(pred & (1ULL << i))
    {
        auto full = i / Reg::u16_count;
        auto part = i % Reg::u16_count;
        regs[rd + full].u16[part] = regs[rs1].u16[0] - regs[rs2 +
full].u16[part];
    }
}

When the VL multiplier is more than 1:
sll.b.vvs rd, rs1, rs2, len=VL*N, pred=rp // subtract byte vector vector
scalar

It treats the scalar argument as a vector of length N.
It does this operation:

auto pred = regs[rp].u64[0];
for(uint64_t i = 0; i < VL * N; i++)
{
    if(pred & (1ULL << (i / N)))
    {
        auto full = i / Reg::u8_count;
        auto part = i % Reg::u8_count;
        auto si = i % N;
        auto s_full = si / Reg::u8_count;
        auto s_part = si % Reg::u8_count;
        regs[rd + full].u8[part] = regs[rs1 + full].u8[part] << regs[rs2 +
s_full].u8[s_part];
    }
}

We haven't yet defined what happens with a scalar rd and vector inputs, I'm
thinking that we will define it to do a vector-reduction. For fmadd and
similar, the reduction is over the add, essentially making a dot-product
instruction.

Note that if we disallow vector reduction for 3 register integer
instructions, we can encode the useful combinations of vector/scalar in 2
bits:
vvv: 00
vvs: 01
vsv: 10
sss: 11

vss can be done by either changing the later uses of rd to take a scalar or
using a vector splat.

Vector reductions can then be encoded using 2 register instructions by
changing the 2-bit vector/scalar field to:
vv: 00
vs: 01
sv: 10
ss: 11

So:
addi.w.sv rd, rs1, imm, len=VL*N, pred=rp
Does:
auto pred = regs[rp].u64[0];
for(uint64_t si = 0; si < N; si++)
{
    auto s_full = si / Reg::u32_count;
    auto s_part = si % Reg::u32_count;
    regs[rd + s_full].u32[s_part] = imm;
}
for(uint64_t i = 0; i < VL * N; i++)
{
    if(pred & (1ULL << (i / N)))
    {
        auto full = i / Reg::u32_count;
        auto part = i % Reg::u32_count;
        auto si = i % N;
        auto s_full = si / Reg::u32_count;
        auto s_part = si % Reg::u32_count;
        regs[rd + s_full].u32[s_part] += regs[rs1 + full].u32[part];
    }
}

A table of the scalar/vector encodings:
1 register, integer:
1-bit field
v: 0
s: 1

2 registers, integer:
2-bit field
vv: 00
vs: 01
sv: 10
ss: 11

3 registers, integer:
2-bit field
vvv: 00
vvs: 01
vsv: 10
sss: 11

1 register, float
1-bit field
v: 0
s: 1

2 registers, float
2-bit field
vv: 00
vs: 01
sv: 10
ss: 11

3 registers, float
3-bit field
vvv: 000
vvs: 001
vsv: 010
vss: 011
svv: 100
svs: 101
ssv: 110
sss: 111

4 registers, float
4-bit field
vvvv: 0000
vvvs: 0001
vvsv: 0010
vvss: 0011
vsvv: 0100
vsvs: 0101
vssv: 0110
vsss: 0111
svvv: 1000
svvs: 1001
svsv: 1010
svss: 1011
ssvv: 1100
ssvs: 1101
sssv: 1110
ssss: 1111

Sorry if this has been asked before, but i am trying to grasp all the
> things right now.
>
No problem.

>
> On Sun, Jan 27, 2019 at 4:46 PM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
>
> > On Sun, Jan 27, 2019, 06:20 Luke Kenneth Casson Leighton <lkcl at lkcl.net
> > wrote:
> >
> > > On Sunday, January 27, 2019, Jacob Lifshay <programmerjake at gmail.com>
> > > wrote:
> > >
> > > >
> > > > For my previous prefix proposal, I just assumed that the encodings
> used
> > > in
> > > > OP-32 would be a subset of the encodings in OP, which seemed pretty
> > > > rational as there's still plenty of unused space.
> > > >
> > > >
> > > Yehyeh. And you can see why the decisions were made. AND is left out
> > > because obviously, AND of 64 bit is no different from AND of 32 bit,
> just
> > > drop the top 32 bits.
> > >
> > > However other ops that may not be emulated by chopping the top 32 bits,
> > > these have their own OP32.
> > >
> > > The assumption has been designed around non-vector engines, basically.
> It
> > > would be madness to add special 16 bit scalar ops to a 64 bit system,
> > total
> > > waste of opcode space.
> > >
> > > Vector and SIMD is a completely different story. Now performance
> matters.
> > > Doing 8 bit ops using 64 bit ALUs is utterly wasteful.
> > >
> > > Unfortunately, adding custom 8/16 bit ops is not a viable option - not
> if
> > > SV is to be a general parallel processing  abstraction layer, that is.
> > > Using one custom 32 bit opcode for a hybrid Prefixed-Compressed
> extension
> > > is about as far as we can push it (or reuse the RVV space).
> > >
> > > The brownfield encoding space may end up being used for future
> > extensions:
> > > xBitManip being the most likely candidate.
> > >
> > From what I saw last time I looked at the public info, they ended up
> using
> > some more of funct7 similar to how the M extension does rather than using
> > something that is different between OP and OP-32
> >
> > >
> > > So where are we. 12 bits.
> > >
> > > * 5 bits vlpr5
> > > * 1 for rd (or 2? issue with 32bit ops)
> > > * 1 for rs (or 2?)
> > >
> > you forgot rs2. some ops (shift, mulhsu, sub, div, mod, etc.) are not
> > commutative, so it'd be nice to have both scalar<<vector and
> > vector<<scalar.
> >
> > > * 2 for elwidth, on arith ops?
> > >
> > fp arith ops have width builtin, so only needed on int ops
> >
> > > * LD/ST contains 8/16/32/64 in op already. Use 2 bits for stride etc
> > mode?
> > > * 1 spare
> > >
> >
> > Jacob
> >
> > >
> > _______________________________________________
> > libre-riscv-dev mailing list
> > libre-riscv-dev at lists.libre-riscv.org
> > http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
> >
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>


More information about the libre-riscv-dev mailing list