[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

lkcl luke.leighton at gmail.com
Wed Sep 11 03:28:58 BST 2019

On Wednesday, September 11, 2019 at 1:53:26 AM UTC+1, MitchAlsup wrote:

> some of the areas have been discussed already: yes, atanh (etc.) can be 
> synthesised - again, annoyingly, they don't produce correctly-rounded 
> results that way [ASINH( x ) = ln( x + SQRT(x**2+1))] so in full-accuracy 
> or high-performance circumstances, Zfhyp is needed.
> At this point I should suggest that a survey be done on GPU codes 
> (separate from GPGPU codes) to see
> which transcendentals show up and how often. My guess is the the set of 
> (SIN, COS, TAN, ATAN, EXP2, Ln2} cover way over 90%-ile.

funnily enough, overnight (reaally should sleep properly sigh) i extracted 
the ISA from MALI Midgard and Vivante (etnaviv) and the list was:
    E8 - fatan_pt2
    F0 - frcp (reciprocal)
    F2 - frsqrt (inverse square root, 1/sqrt(x))
    F3 - fsqrt (square root)
    F5 - flog2
    F6 - fsin
    F7 - fcos
    F9 - fatan_pt1

where vivante does not have FATAN/FATAN2, and that's about all.

however these are *mobile* class GPUs, the focus being on reasonable 
performance, battery life, and efficiency.  at the lower end you're lucky 
if they can do 720p @ 30fps and that's fine, because they only use around 
0.5 watts (not 30 watts, not 150 watts: *0.5* watts).

i just found the AMD R600 ISA doc, R600_Instruction_Set_Architecture.pdf
COS (appx)
SIN (appx)

so pretty much the same except no TAN or ATAN, and they have appx variants 
and IEEE754 variants on some of the ops.


> some implementations (mitch's), they're definitely custom-optimised for 
> 3D: 0.65-0.5 ULP or so, and targetted at FP32 only.  it'd need an entire 
> redesign to target FP64 or even IEEE754.
> I know of 32-bit transcendental implementations, some based on the work of 
> Pierno, others based on the
> work of Matula and Briggs, both of these are large table entry (128) and 
> Quadratic, and then there is mine 
> which is small tables and cubic. The FP32 versions can also be used to 
> calculate FP16 versions. 
> NONE of these can deliver correctly rounded (in IEEE 754-2008) sense, the 
> quadratic versions state they 
> give 1 ULP, I have measured tham as bad as 1.6 ULP. The GPU std is better 
> than 3 ULP in most cases.
> Then there is a different technology I have developed that targets FP64 
> and targets "faithfully" rounded 
> results--n practice, I am achieving a RMS error of 0.502 for the 4 
> important functions, and a calculation 
> delay similar to that of FDIV. If one wanted to produce IEEE 754-2008 
> results, the coefficient table 
> should be at least 3× bigger and the calculations about 3×-4× longer--all 
> for 0.002 in accuracy.

which is fine for Numerical Computation, not fine for a 
commercially-competitive GPU.

> whereas CORDIC, although only doing one bit at a time (yes i saw RADIX-4 
> CORDIC implementations out there), can be adapted to run a few more 
> iterations (perhaps even use microcode to feed back twice), to get better 
> accuracy or even handle both FP32 and FP64 with 2x the completion time on 
> FP64.
> The only thing CORDIC has going for it is smallness.

making it perfect for the ultra-low-power use-case of e.g. smartwatches, 
where the power budget for the GPU is measured in milliwatts.

this exercise allowed me to do a couple of things:

(1) clearly identify the different markets

* ultra-low-power - smartwatches.
* mobile/embedded GPUs and low-end desktop GPUs
* High Performance Computing (and "Gaming" GPUs).

(2) from that, notice that:

* Zftrans, Zftrigpi and Zftrignpi basically cover mobile/embedded 3D near 

* LOG1P and EXPM1 [LOG(e, 1+rs1), POW(e, rs1)-1] should have been moved to 
ZftransExt (alongside EXP10, LOG1, LN etc.)

apart from that, the prior review we (collectively) did a few weeks back 
pretty much called it.  i "guessed" that LOG1P and EXPM1 should go into 
Zftrans - should we "listen" to what commercial (custom-optimised) GPUs 
do?  don't know.

also, still outstanding: should RECIP-SQRT be its own separate extension 
(with one opcode)?  or is it worthwhile dropping into the Zftrans group, 
alongside RECIP, EXP2 and LOG2?  don't know.



More information about the libre-riscv-dev mailing list