[libre-riscv-dev] [isa-dev] 3D Open Graphics Alliance

Mon Aug 12 07:31:30 BST 2019

On Sun, Aug 11, 2019 at 8:17 PM lkcl <luke.leighton at gmail.com> wrote:
> > So I agree with you that we should look at SPIR-V and the Vulkan ISA seriously.
> > Now that ISA is very complex and many of the instructions may possibly be
> > reconstructed from simpler ones.
>
> yes.  as the slides from SIGGRAPH2019 show, the number of opcodes needed is enormous.  Texturisation, mipmaps, Z-Buffers, texture buffers, normalisation, dotproduct, LERP/SLERP, sizzle, alpha blending, interpolation, vectors and matrices, and that's just the *basics*!

note: lerp is another term for linear interpolation

>
> further optimisations will need to be added over time, as well.
>
> the pressure on the OP32 space is therefore enormous, given that "embedded" low-power requirements cannot be met by moving to the 48 and 64 bit opcode space.
>
> > We need to thus perhaps look at a "minimized" subset of the Vulkan ISA
> > instructions that truly define the atomic operations from which the full
> > ISA can be constructed. So the instruction decode hardware can implement
> > this "higher-level ISA" - perhaps in microcode - from the "atomic ISA"
> > at runtime while hardware support is only provided for the "atomic ISA".
>
> yes.  a microcode engine is something that may be useful in other implementations as well (for other purposes) so should i feel be a separate proposal.

I personally think that having LLVM inline the corresponding
implementations of the more complex operations is the better way to
go. For most Vulkan shaders, they are compiled at run-time, so LLVM
can do feature detection to determine which operations are implemented
in the hardware and which ones need a software implementation.

This reduces the pressure on the opcode space by a lot.

> > From the SIGGRAPH BOF it was clear there are competing interests. Some people
> > wanted explicit texture mapping instructions while others wanted HPC type
> >  threaded vector extensions.
>
> interesting.

Having texture mapping instructions in HW is a really good idea for
traditional 3D, since Vulkan allows the texture mode to be dynamically
selected, trying to implement it in software would require having the
texture operations use dynamic dispatch with maybe static checks for
recently used modes to reduce the latency. See VkSampler's docs.

> > Although each of these can be accommodated we need to adjudicate the location
> > in the process pipeline where they belong - atomic ISA, higher-level ISA or
> > higher-level graphics library.
>
> OpenCL compliance is pretty straightforward to achieve.  it could be done by any standard supercomputer Vector Compute Engine.
>
> a good Vector ISA does **NOT** automatically make a successful GPU (cf: MIAOW, Nyuzi, Larrabee).
>
> 3D Graphics is ridiculously complex and comprehensive, and therefore requires careful step-by-step planning to meet the extremely demanding and heavily optimised de-facto industry-standard expectations met by modern GPUs, today (fixed-functions are out: shader engines are in).

Fixed function operations in the form of custom opcodes or separate
accelerators are still used for several operations that are rather
slow to do with standard Vector or SIMD instructions: Triangle
Rasterization (get list of pixels/fragments/samples in a triangle --
one of the slower parts for SW to implement because of all the finicky
special-cases) and some ray-tracing operations (libre-riscv probably
won't implement that and just rely on SW if we implement ray-tracing
extensions at all for our first design)

> we took the strategy with the Libre RISC-V SoC to do a Vulkan SPIR-V to LLVM-IR compiler for very good reasons.
>
> firstly: that our first milestone - operational compliance on *x86* LLVM - removes the possibility of hardware implementation dependence or bugs, and gets us to a "known good" position to move to the next phase.  like MesaGL, it also makes a good Reference Implementation.
>
> secondly: to begin an *iterative* process of adding in hardware acceleration, associated SPIR-V opcode support and associated LLVM-IR compiler support for the same, one opcode or opcode group at a time.
>
> by evaluating the performance increase at each phase, depending on Alliance Member customer requirements, we can move forward quickly and in a useful and quantitatively-measureable fashion, meeting (keeping) the Khronos Group's Conformance in mind at *every* step.
>
> some of the required opcodes are going to be blindingly obvious (the transcendentals and trigonometrics), others are going to be both harder to implement, requiring significant research to track down, more than anything.  Vulkan's "Texture" features are liberally sprinkled throughout the spec, for example, and the data structures used in the binary-formatted texture data need to be tracked down.

data formats: https://www.khronos.org/registry/DataFormat/specs/1.1/dataformat.1.1.pdf

> other opcodes will be critically and specifically dependent on the existence of Vector and in some cases Matrix support.  Swizzle is particularly challenging as in its full form it requires a whopping 32 bits of immediate data, in order to cover a 3-arg operand if used with 4-long vectors.

There are 340 (4^4 + 4^3 + 4^2 + 4^1) possible swizzle operations from
a 4-element vector to 1, 2, 3, or 4-element vectors. That should
easily fit in 10 bits with a straight-forward encoding. If we want to
swizzle in 4 possible constants (or a second input vector) as well as
the 4 elements of the input vector, there are 4680 possible (8^4 + 8^3
+ 8^2 + 8^1) swizzle operations, which I can fit in 14bits of
immediate with a not particularly dense encoding.

Encoding for 4 input elements and 4 constants (or second input vector):
2 bits: element-count of output vector
3 bits: output element 0
3 bits: output element 1
3 bits: output element 2
3 bits: output element 3

Encoding for only 4 input elements:
2 bits: element-count of output vector
2 bits: output element 0
2 bits: output element 1
2 bits: output element 2
2 bits: output element 3

So, you overestimated the number of immediate bits needed by quite a lot.

Jacob