[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

Wed Sep 11 01:53:19 BST 2019

Mitch AlsupMitchAlsup at aol.com

-----Original Message-----
From: lkcl <luke.leighton at gmail.com>
To: RISC-V ISA Dev <isa-dev at groups.riscv.org>
Cc: luke.leighton <luke.leighton at gmail.com>; mitchalsup <mitchalsup at aol.com>; libre-riscv-dev <libre-riscv-dev at lists.libre-riscv.org>
Sent: Tue, Sep 10, 2019 1:23 pm
Subject: Re: [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

On Tuesday, September 10, 2019 at 5:43:52 PM UTC+1, Allen Baum wrote:
I think identifying which subsets are important for which platform is a first step. From there, you get to identify the cost for that platform (e.g. for low performance requirement platforms, implementations that can use cordic, and the added cost of additional ops after the may be fairly insignificant. That might also be true of Mitch's implementations as well, I don't know.Those costs may vary significantly by platform, i.e. even if the additional cost is just more ROM, that could be significant in a smaller implementation.

yehyeh, got it, agreed.  thanks for the suggestion, will see what i can do.
some of the areas have been discussed already: yes, atanh (etc.) can be synthesised - again, annoyingly, they don't produce correctly-rounded results that way [ASINH( x ) = ln( x + SQRT(x**2+1))] so in full-accuracy or high-performance circumstances, Zfhyp is needed.
At this point I should suggest that a survey be done on GPU codes (separate from GPGPU codes) to seewhich transcendentals show up and how often. My guess is the the set of (SIN, COS, TAN, ATAN, EXP2, Ln2} cover way over 90%-ile.
some implementations (mitch's), they're definitely custom-optimised for 3D: 0.65-0.5 ULP or so, and targetted at FP32 only.  it'd need an entire redesign to target FP64 or even IEEE754.
I know of 32-bit transcendental implementations, some based on the work of Pierno, others based on thework of Matula and Briggs, both of these are large table entry (128) and Quadratic, and then there is mine which is small tables and cubic. The FP32 versions can also be used to calculate FP16 versions. 
NONE of these can deliver correctly rounded (in IEEE 754-2008) sense, the quadratic versions state they give 1 ULP, I have measured tham as bad as 1.6 ULP. The GPU std is better than 3 ULP in most cases.
Then there is a different technology I have developed that targets FP64 and targets "faithfully" rounded results--n practice, I am achieving a RMS error of 0.502 for the 4 important functions, and a calculation delay similar to that of FDIV. If one wanted to produce IEEE 754-2008 results, the coefficient table should be at least 3× bigger and the calculations about 3×-4× longer--all for 0.002 in accuracy.
whereas CORDIC, although only doing one bit at a time (yes i saw RADIX-4 CORDIC implementations out there), can be adapted to run a few more iterations (perhaps even use microcode to feed back twice), to get better accuracy or even handle both FP32 and FP64 with 2x the completion time on FP64.
The only thing CORDIC has going for it is smallness.
i'll create sections for each.
l.