[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal
luke.leighton at gmail.com
Sat Sep 14 02:06:47 BST 2019
On Saturday, September 14, 2019 at 4:56:07 AM UTC+8, Jacob Lifshay wrote:
> Some notes:
> I think it may be worthwhile to have separate Ztrans extension names to indicate the levels of accuracy that are implemented, allowing a low-precision implementation of all instructions outside of F and D while having full-precision implementations of F and D for code compatibility.
Hum, hum, don't know. My concern: that would be an NxM table of extension names. There are around 8 Ztrans ectensions, times four (so far, just found that OpenCL is different from Vulkan so that's 5) which would be 40 potential different extension names, rather than N+M which would be 12-13.
Are any of the permutations "extremely unlikely to be implemented" or "forbiddable"?
> Note that Vulkan requires full ieee754 precision for all F/D instructions except for fdiv and fsqrt.
Started investigating, found this
which is not Vulkan it's OpenCL, which is *different* from vulkan (sigh) :)
> fdiv and fsqrt are easy enough to implement in full precision using the iterative shift-add/shift-sub algorithms that take up similar space to a few adders and shift registers and can be shared with the integer divider that I think it may be better to just require full precision mode for F/D - there can be a separate slow iterative div/sqrt unit if faster low-precision fdiv/fsqrt are wanted in the main ALUs. the iterative div/sqrt HW would take up much less space than even a multiplier (unless multiplication is also iterative, in which case it can also share HW with the div/sqrt unit).
Ok so for a hybrid design, where compliance with both IEEE754 and Vulkan or OpenCL is required, you are suggesting to do a pipelined (fast, large area) OpenCL/Vulkan ALU, with reduced accuracy, and for IEEE754 have a blocking Finite State Machine unit which eventually produces the correctly rounded answer?
The logical reasoning being (recalling some discussions we had a few months back), that for "good" 3D you absolutely cannot have blocking computations which do not complete in a guaranteed timeframe.
Whereas for standard UNIX workloads that is extremely unlikely to matter.
An augmentation of this idea would be to use NR or other iterative algorithm as a microcode final phase based on the output from the less accurate pipeline.
> I would expect there to be a fast HW multiplier even on micropower gpus because a large proportion of the operations need multiplication so you could get a overall several hundred percent speedup over an iterative multiplier.
> OpenCL's accuracy requirements are similar to Vulkan's -- full precision for neg/abs/add/sub/mul/muladd and reduced requirements for everything else.
Looks like a table is needed on the fpacc page. And OpenCL added as its own fpacc table entry.
More information about the libre-riscv-dev