[libre-riscv-dev] [isa-dev] 3D Open Graphics Alliance

Mon Aug 12 18:12:34 BST 2019

On Monday, August 12, 2019 at 1:31:44 AM UTC-5, Jacob Lifshay wrote:
>
> On Sun, Aug 11, 2019 at 8:17 PM lkcl <luke.l... at gmail.com <javascript:>> 
> wrote: 
> > > So I agree with you that we should look at SPIR-V and the Vulkan ISA 
> seriously. 
> > > Now that ISA is very complex and many of the instructions may possibly 
> be 
> > > reconstructed from simpler ones. 
> > 
> > yes.  as the slides from SIGGRAPH2019 show, the number of opcodes needed 
> is enormous.  Texturisation, mipmaps, Z-Buffers, texture buffers, 
> normalisation, dotproduct, LERP/SLERP, sizzle, alpha blending, 
> interpolation, vectors and matrices, and that's just the *basics*! 
>
> note: lerp is another term for linear interpolation 
>
> > 
> > further optimisations will need to be added over time, as well. 
> > 
> > the pressure on the OP32 space is therefore enormous, given that 
> "embedded" low-power requirements cannot be met by moving to the 48 and 64 
> bit opcode space. 
> > 
> > > We need to thus perhaps look at a "minimized" subset of the Vulkan ISA 
> > > instructions that truly define the atomic operations from which the 
> full 
> > > ISA can be constructed. So the instruction decode hardware can 
> implement 
> > > this "higher-level ISA" - perhaps in microcode - from the "atomic ISA" 
> > > at runtime while hardware support is only provided for the "atomic 
> ISA". 
> > 
> > yes.  a microcode engine is something that may be useful in other 
> implementations as well (for other purposes) so should i feel be a separate 
> proposal. 
>
> I personally think that having LLVM inline the corresponding 
> implementations of the more complex operations is the better way to 
> go. For most Vulkan shaders, they are compiled at run-time, so LLVM 
> can do feature detection to determine which operations are implemented 
> in the hardware and which ones need a software implementation. 
>
> This reduces the pressure on the opcode space by a lot. 
>
> > > From the SIGGRAPH BOF it was clear there are competing interests. Some 
> people 
> > > wanted explicit texture mapping instructions while others wanted HPC 
> type 
> > >  threaded vector extensions. 
> > 
> > interesting. 
>
> Having texture mapping instructions in HW is a really good idea for 
> traditional 3D, since Vulkan allows the texture mode to be dynamically 
> selected, trying to implement it in software would require having the 
> texture operations use dynamic dispatch with maybe static checks for 
> recently used modes to reduce the latency. See VkSampler's docs. 
>
> > > Although each of these can be accommodated we need to adjudicate the 
> location 
> > > in the process pipeline where they belong - atomic ISA, higher-level 
> ISA or 
> > > higher-level graphics library. 
> > 
> > OpenCL compliance is pretty straightforward to achieve.  it could be 
> done by any standard supercomputer Vector Compute Engine. 
> > 
> > a good Vector ISA does **NOT** automatically make a successful GPU (cf: 
> MIAOW, Nyuzi, Larrabee). 
> > 
> > 3D Graphics is ridiculously complex and comprehensive, and therefore 
> requires careful step-by-step planning to meet the extremely demanding and 
> heavily optimised de-facto industry-standard expectations met by modern 
> GPUs, today (fixed-functions are out: shader engines are in). 
>
> Fixed function operations in the form of custom opcodes or separate 
> accelerators are still used for several operations that are rather 
> slow to do with standard Vector or SIMD instructions:

When I did a 3D GPU, There was a HW unit each to perform::
a) vertex to thread assignment
b) rasterize primitive
c) interpolate rasterized point
d) Texture load
and some higher layer HW that was used to time the various activities 
through the 100,000 clock pipeline.
If you include Tessellation and Geometry, both of whom can generate a 
volcano of new primitives, There
are significant performance gains (more than 2×) to be had by doing the 
above in HW function units.

>           Triangle 
> Rasterization (get list of pixels/fragments/samples in a triangle -- 
> one of the slower parts for SW to implement because of all the finicky 
> special-cases) and some ray-tracing operations (libre-riscv probably 
> won't implement that and just rely on SW if we implement ray-tracing 
> extensions at all for our first design) 
>
> > we took the strategy with the Libre RISC-V SoC to do a Vulkan SPIR-V to 
> LLVM-IR compiler for very good reasons. 
> > 
> > firstly: that our first milestone - operational compliance on *x86* LLVM 
> - removes the possibility of hardware implementation dependence or bugs, 
> and gets us to a "known good" position to move to the next phase.  like 
> MesaGL, it also makes a good Reference Implementation. 
> > 
> > secondly: to begin an *iterative* process of adding in hardware 
> acceleration, associated SPIR-V opcode support and associated LLVM-IR 
> compiler support for the same, one opcode or opcode group at a time. 
> > 
> > by evaluating the performance increase at each phase, depending on 
> Alliance Member customer requirements, we can move forward quickly and in a 
> useful and quantitatively-measureable fashion, meeting (keeping) the 
> Khronos Group's Conformance in mind at *every* step. 
> > 
> > some of the required opcodes are going to be blindingly obvious (the 
> transcendentals and trigonometrics), others are going to be both harder to 
> implement, requiring significant research to track down, more than 
> anything.  Vulkan's "Texture" features are liberally sprinkled throughout 
> the spec, for example, and the data structures used in the binary-formatted 
> texture data need to be tracked down. 
>
> data formats: 
> https://www.khronos.org/registry/DataFormat/specs/1.1/dataformat.1.1.pdf 
>
> > other opcodes will be critically and specifically dependent on the 
> existence of Vector and in some cases Matrix support.  Swizzle is 
> particularly challenging as in its full form it requires a whopping 32 bits 
> of immediate data, in order to cover a 3-arg operand if used with 4-long 
> vectors. 
>
> There are 340 (4^4 + 4^3 + 4^2 + 4^1) possible swizzle operations from 
> a 4-element vector to 1, 2, 3, or 4-element vectors. That should 
> easily fit in 10 bits with a straight-forward encoding. If we want to 
> swizzle in 4 possible constants (or a second input vector) as well as 
> the 4 elements of the input vector, there are 4680 possible (8^4 + 8^3 
> + 8^2 + 8^1) swizzle operations, which I can fit in 14bits of 
> immediate with a not particularly dense encoding. 
>
> Encoding for 4 input elements and 4 constants (or second input vector): 
> 2 bits: element-count of output vector 
> 3 bits: output element 0 
> 3 bits: output element 1 
> 3 bits: output element 2 
> 3 bits: output element 3 
>
> Encoding for only 4 input elements: 
> 2 bits: element-count of output vector 
> 2 bits: output element 0 
> 2 bits: output element 1 
> 2 bits: output element 2 
> 2 bits: output element 3 
>
> So, you overestimated the number of immediate bits needed by quite a lot. 
>
> Jacob 
>