[libre-riscv-dev] Instruction sorta-prefixes for easier high-register access

Thu Jan 31 06:08:45 GMT 2019

On Wed, Jan 30, 2019, 20:34 Luke Kenneth Casson Leighton <lkcl at lkcl.net
wrote:

> ok so i thought about the vlpN concept, and if the register-prefixing
> encodes scalar/vector already, then reserving '0b00 for "scalar" is
> redundant.  it would therefore be better to split out VL from
> predicate specs.
>
I disagree, having a combined field allows using the otherwise-reserved
predicate of "never" to encode other less common VL-multipliers.

predicate:
000: x0 (never)
001: pr1
010: pr2
011: pr3
100: ~x0 (always)
101: ~pr1
110: ~pr2
111: ~pr3

>
> there is however a small problem with VL multipliers: they break the
> Vectorisation Loop paradigm, turning it effectively into a SIMD-like
> one instead.

Not really, see following examples.

>
> i am slightly concerned that the templates for VL-based loops would
> need to be much more complex (less uniform), as the multipliers now
> need to be taken into account within the loop, on a per-instruction
> basis instead of a per-loop basis.
>
VL multipliers are basically embedding the short-length (1 to 4) SIMD
vectors used in Vulkan shaders into a VL-based vectorization loop.

With standard VL-based vectorization, the loop:
float a[], b[], c[], d[];
for(int i = 0; i < 1000; i++)
{
    a[i] = b[i] + c[i] * d[i];
}
vectorization produces:
for(int i = 0;;)
{
    VL = setvl(1000 - i);
    vecVL v = loadVL(&b[i]) + loadVL(&c[i]) * loadVL(&d[i]);
    storeVL(&a[i], v);
    i += VL;
}

With vl-multipliers, we can similarly vectorize the loop:
struct VertexIn
{
    vec3 position;
    vec3 normal;
    vec4 color; // rgba
};
struct VertexOut
{
    vec4 position; // xyzw
    vec4 color;
};
VertexIn vertexes_in[];
VertexOut vertexes_out[];
vec3 light_dir;
float ambient, diffuse;
for(int i = 0; i < 1000; i++)
{
    // calculate vertex colors using
    // lambert's cos model and fixed ambient brightness
    vec3 n = vertexes_in[i].normal;
    vec3 l = light_dir;
    float dot = n.x * l.x + n.y * l.y + n.z * l.z;
    float brightness = max(dot, 0.0) * diffuse + ambient;
    vec4 c = vertexes_in[i].color;
    c.rgb *= brightness;
    vertexes_out[i].color = c;
    // orthographic projection
    vertexes_out[i].position = vec4(vertexes_in[i].position, 1.0);
}

vectorization produces:
for(int i = 0;;)
{
    VL = setvl(1000 - i);
    vec3xVL n = load3xVL_strided(&vertexes_in[i].normal, sizeof(VertexIn));
    vec3 l = light_dir;
    vecVL dot = n.x * l.x + n.y * l.y + n.z * l.z;
    vecVL brightness = max(dot, 0.0) * diffuse + ambient;
    vec4xVL c = load4xVL_strided(&vertexes_in[i].color, sizeof(VertexIn));
    vec3xVL c_rgb = c.rgb;
    c_rgb *= brightness;
    c.rgb = c_rgb;
    store4xVL_strided(&vertexes_out[i].color, c, sizeof(VertexOut));
    vec4xVL p = 1.0;
    p.xyz = load3xVL_strided(&vertexes_in[i].position, sizeof(VertexIn));
    store4xVL_strided(&vertexes_out[i].position, p, sizeof(VertexOut));
    i += VL;
}

The vl-multipliers that Vulkan needs are 1, 2, 3, and 4.
The vl-multipliers that OpenCL needs are 1, 2, 3, 4, 8, and 16, though we
can probably get away with just 1 to 4 and use multiple vectors for 8 and
16.

So, hopefully, the examples I gave help show how vl-multipliers are
probably the most straightforward way to vectorize graphics code with
variable length vectors.

Jacob