[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

Sat Sep 14 02:06:47 BST 2019

On Saturday, September 14, 2019 at 4:56:07 AM UTC+8, Jacob Lifshay wrote:
> Some notes:
> 
> 
> I think it may be worthwhile to have separate Ztrans extension names to indicate the levels of accuracy that are implemented, allowing a low-precision implementation of all instructions outside of F and D while having full-precision implementations of F and D for code compatibility.

Hum, hum, don't know. My concern: that would be an NxM table of extension names. There are around 8 Ztrans ectensions, times four (so far, just found that OpenCL is different from Vulkan so that's 5) which would be 40 potential different extension names, rather than N+M which would be 12-13.

Are any of the permutations "extremely unlikely to be implemented" or "forbiddable"?

> 
> Note that Vulkan requires full ieee754 precision for all F/D instructions except for fdiv and fsqrt.

Started investigating, found this
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_Env.html#relative-error-as-ulps

which is not Vulkan it's OpenCL, which is *different* from vulkan (sigh) :)

Also this:
https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#spirvenv-precision-operation

> 
> fdiv and fsqrt are easy enough to implement in full precision using the iterative shift-add/shift-sub algorithms that take up similar space to a few adders and shift registers and can be shared with the integer divider that I think it may be better to just require full precision mode for F/D - there can be a separate slow iterative div/sqrt unit if faster low-precision fdiv/fsqrt are wanted in the main ALUs. the iterative div/sqrt HW would take up much less space than even a multiplier (unless multiplication is also iterative, in which case it can also share HW with the div/sqrt unit).

Ok so for a hybrid design, where compliance with both IEEE754 and Vulkan or OpenCL is required, you are suggesting to do a pipelined (fast, large area) OpenCL/Vulkan ALU, with reduced accuracy, and for IEEE754 have a blocking Finite State Machine unit which eventually produces the correctly rounded answer?

The logical reasoning being (recalling some discussions we had a few months back), that for "good" 3D you absolutely cannot have blocking computations which do not complete in a guaranteed timeframe.

Whereas for standard UNIX workloads that is extremely unlikely to matter.

An augmentation of this idea would be to use NR or other iterative algorithm as a microcode final phase based on the output from the less accurate pipeline.

> 
> I would expect there to be a fast HW multiplier even on micropower gpus because a large proportion of the operations need multiplication so you could get a overall several hundred percent speedup over an iterative multiplier.

Indeed.

> 
> 
> 
> 
> OpenCL's accuracy requirements are similar to Vulkan's -- full precision for neg/abs/add/sub/mul/muladd and reduced requirements for everything else.

Looks like a table is needed on the fpacc page. And OpenCL added as its own fpacc table entry.

L.

> 
> Jacob