[libre-riscv-dev] SVprefix v0.2

Mon Feb 18 08:37:12 GMT 2019

On Mon, Feb 18, 2019 at 12:10 AM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> Ok so twin predication, which is an extremely powerful way to express all
> of gather, scatter, splat, insert, extract, is missing.
>
> Twin predication in the original SV is present on all 2op instructions,
> including LD, ST, MV, FCVT, FCLASS, FSGN, and the int to fp conversions.
>
> Without twin predication that entire suite of instructions is missing some
> very powerful benefits, which have to be substituted by explicit
> instructions.
>
> That in turn, being impossible to consider adding all of them, means that
> only a handful could be considered, and the general idea is to avoid adding
> any at all.
>
> Twin predication can be added by giving an alternative meaning to the
> predicate bits for 2-op instructions. 00: no predication. 01: single
> predication on src using pr1. 10: single predication on dest using pr2. 11:
> twin predication using pr1 and pr2. More bits if available can invert some
> of pr1/pr2 combinations etc. However not as high a priority as src dest
> predication.
>
> That in turn also removes the need for the gather/scatter state in LD/ST,
>
We would still need the Vstart CSR for loads/stores that fault after
executing part of the instruction.

> and no need for a gather scatter instruction.
>
Gather/scatter is more powerful than twin predication. The following code
can't be expressed in a single instruction using twin predication, but can
using register gather:
float dest[4];
float src[4];
int indexes[4] = {0, 3, 2, 2};
for(int i = 0; i < 4; i++)
{
    dest[i] = src[indexes[i]];
}

Same with the following:
struct Node
{
    Node *next;
    Data data;
};
Node *src[N];
Node *dest[N];
for(int i = 0; i < N; i++)
{
    dest[i] = src[i]->next;
}

Also, gather/scatter is more generally understood and the compilers already
support optimizing it.

>
>
>
> Also there is no bit for specifying zeroing or non zeroing mode in
> predication.  If there is no room, non zeroing (skipping elements) would i
> feel be preferable as it allows interleave of predicate with inverted
> predicate masks, to give 100% ALU utilisation on OoO.
>
> Which, if decided, would need to be documented.
>
Ok, lets go with leaving masked-off elements with their previous value. It
will just be a pain later if we want a higher-performance version, because
of needing the previous contents of rd to mux into the output, meaning we
would need more read ports.

>
> Twin predication is I realise very odd. I have never encountered a design
> that has it. I do not know why. I do know that its implementation may
> require a serial algorithm for certain combinations (or that getting full
> parallelism may be tricky).
>
> Certainly, the OoO instructiin issue will need to be stalled when a twin
> predication op is encountered, as it is not only impossible to know how
> many instructions will need to be issued, you have no idea which registers
> will be needed either.
>
> I know we discussed a potential way round that (reserving a *range* of
> registers), it may still apply, here, just that there are two ranges to
> reserve not one.
>
> L.
>
>
>
> --
> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>