[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

Sat Sep 14 02:46:19 BST 2019

Mitch AlsupMitchAlsup at aol.com

-----Original Message-----
From: lkcl <luke.leighton at gmail.com>
To: RISC-V ISA Dev <isa-dev at groups.riscv.org>
Cc: mitchalsup <mitchalsup at aol.com>; allen.baum <allen.baum at esperantotech.com>; luke.leighton <luke.leighton at gmail.com>; libre-riscv-dev <libre-riscv-dev at lists.libre-riscv.org>
Sent: Fri, Sep 13, 2019 8:06 pm
Subject: Re: [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

On Saturday, September 14, 2019 at 4:56:07 AM UTC+8, Jacob Lifshay wrote:
> Some notes:
> 
> 
> I think it may be worthwhile to have separate Ztrans extension names to indicate the levels of accuracy that are implemented, allowing a low-precision implementation of all instructions outside of F and D while having full-precision implementations of F and D for code compatibility.

Hum, hum, don't know. My concern: that would be an NxM table of extension names. There are around 8 Ztrans ectensions, times four (so far, just found that OpenCL is different from Vulkan so that's 5) which would be 40 potential different extension names, rather than N+M which would be 12-13.

Are any of the permutations "extremely unlikely to be implemented" or "forbiddable"?

> 
> Note that Vulkan requires full ieee754 precision for all F/D instructions except for fdiv and fsqrt.

Started investigating, found this
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_Env.html#relative-error-as-ulps

which is not Vulkan it's OpenCL, which is *different* from vulkan (sigh) :)

Also this:
https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#spirvenv-precision-operation

> 
> fdiv and fsqrt are easy enough to implement in full precision using the iterative shift-add/shift-sub algorithms that take up similar space to a few adders and shift registers and can be shared with the integer divider that I think it may be better to just require full precision mode for F/D - there can be a separate slow iterative div/sqrt unit if faster low-precision fdiv/fsqrt are wanted in the main ALUs. the iterative div/sqrt HW would take up much less space than even a multiplier (unless multiplication is also iterative, in which case it can also share HW with the div/sqrt unit).

Ok so for a hybrid design, where compliance with both IEEE754 and Vulkan or OpenCL is required, you are suggesting to do a pipelined (fast, large area) OpenCL/Vulkan ALU, with reduced accuracy, and for IEEE754 have a blocking Finite State Machine unit which eventually produces the correctly rounded answer?
In MY OPINION::Once you bite off on the fact that you are going to be doing lots and lots of FMACs, you will find thatthe overhead to do correct rounding, AND support deNormals is close to ZERO. This FMAC unit willinclude FADD, FSUB, FMUL, FMAC.
The above FMAC can be made to support FDIV, FSQRT, RSQRT at what I consider relatively low cost(basically 2 internal (fraction width+4-bits) flip-flops, and the multiplier array goes from n×n to (n+4)×(n+4) and some constant tables. Here the multiplier array adder dominates the table size.This MAY not be the most power efficient version of FDIV,... but it provides the infrastructure for high performance Transcendentals at low cost.
The above FMAC for FDIV, FSQRT, can also be made to perform transcendentals for the cost of a feedback loop near the "wide adder" and some more tables. This "about" doubles the table size (from above)and has another small adder for the feed-back loop. 
When the building block is an FMAC, there is NO REASON to leave off correct rounding, nor is thereany reason to leave out deNorms. The total overhead to support both is in the 1% range--once you have bitten of on the FMAC unit being the base calculation unit for floating point.
The second paragraph above covers basically everything that IEEE requires and what Vulcan require.The third paragraph builds capability at reasonably low cost. 

The logical reasoning being (recalling some discussions we had a few months back), that for "good" 3D you absolutely cannot have blocking computations which do not complete in a guaranteed timeframe.

Whereas for standard UNIX workloads that is extremely unlikely to matter.

An augmentation of this idea would be to use NR or other iterative algorithm as a microcode final phase based on the output from the less accurate pipeline.

> 
> I would expect there to be a fast HW multiplier even on micropower gpus because a large proportion of the operations need multiplication so you could get a overall several hundred percent speedup over an iterative multiplier.

Indeed.

> 
> 
> 
> 
> OpenCL's accuracy requirements are similar to Vulkan's -- full precision for neg/abs/add/sub/mul/muladd and reduced requirements for everything else.

Looks like a table is needed on the fpacc page. And OpenCL added as its own fpacc table entry.

L.

> 
> Jacob