[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

Wed Sep 11 03:28:58 BST 2019

On Wednesday, September 11, 2019 at 1:53:26 AM UTC+1, MitchAlsup wrote:

> some of the areas have been discussed already: yes, atanh (etc.) can be 
> synthesised - again, annoyingly, they don't produce correctly-rounded 
> results that way [ASINH( x ) = ln( x + SQRT(x**2+1))] so in full-accuracy 
> or high-performance circumstances, Zfhyp is needed.
>
> At this point I should suggest that a survey be done on GPU codes 
> (separate from GPGPU codes) to see
> which transcendentals show up and how often. My guess is the the set of 
> (SIN, COS, TAN, ATAN, EXP2, Ln2} cover way over 90%-ile.
>

funnily enough, overnight (reaally should sleep properly sigh) i extracted 
the ISA from MALI Midgard and Vivante (etnaviv) and the list was:
    E8 - fatan_pt2
    F0 - frcp (reciprocal)
    F2 - frsqrt (inverse square root, 1/sqrt(x))
    F3 - fsqrt (square root)
    F5 - flog2
    F6 - fsin
    F7 - fcos
    F9 - fatan_pt1

where vivante does not have FATAN/FATAN2, and that's about all.

however these are *mobile* class GPUs, the focus being on reasonable 
performance, battery life, and efficiency.  at the lower end you're lucky 
if they can do 720p @ 30fps and that's fine, because they only use around 
0.5 watts (not 30 watts, not 150 watts: *0.5* watts).

i just found the AMD R600 ISA doc, R600_Instruction_Set_Architecture.pdf
COS (appx)
EXP2
LOG (IEEE754)
RECIP
RSQRT
SQRT
SIN (appx)

so pretty much the same except no TAN or ATAN, and they have appx variants 
and IEEE754 variants on some of the ops.

> some implementations (mitch's), they're definitely custom-optimised for 
> 3D: 0.65-0.5 ULP or so, and targetted at FP32 only.  it'd need an entire 
> redesign to target FP64 or even IEEE754.
>
> I know of 32-bit transcendental implementations, some based on the work of 
> Pierno, others based on the
> work of Matula and Briggs, both of these are large table entry (128) and 
> Quadratic, and then there is mine 
> which is small tables and cubic. The FP32 versions can also be used to 
> calculate FP16 versions. 
>
> NONE of these can deliver correctly rounded (in IEEE 754-2008) sense, the 
> quadratic versions state they 
> give 1 ULP, I have measured tham as bad as 1.6 ULP. The GPU std is better 
> than 3 ULP in most cases.
>
> Then there is a different technology I have developed that targets FP64 
> and targets "faithfully" rounded 
> results--n practice, I am achieving a RMS error of 0.502 for the 4 
> important functions, and a calculation 
> delay similar to that of FDIV. If one wanted to produce IEEE 754-2008 
> results, the coefficient table 
> should be at least 3× bigger and the calculations about 3×-4× longer--all 
> for 0.002 in accuracy.
>

which is fine for Numerical Computation, not fine for a 
commercially-competitive GPU.

>
> whereas CORDIC, although only doing one bit at a time (yes i saw RADIX-4 
> CORDIC implementations out there), can be adapted to run a few more 
> iterations (perhaps even use microcode to feed back twice), to get better 
> accuracy or even handle both FP32 and FP64 with 2x the completion time on 
> FP64.
>
> The only thing CORDIC has going for it is smallness.
>

making it perfect for the ultra-low-power use-case of e.g. smartwatches, 
where the power budget for the GPU is measured in milliwatts.

this exercise allowed me to do a couple of things:

(1) clearly identify the different markets

* ultra-low-power - smartwatches.
* mobile/embedded GPUs and low-end desktop GPUs
* High Performance Computing (and "Gaming" GPUs).

(2) from that, notice that:

* Zftrans, Zftrigpi and Zftrignpi basically cover mobile/embedded 3D near 
perfectly

* LOG1P and EXPM1 [LOG(e, 1+rs1), POW(e, rs1)-1] should have been moved to 
ZftransExt (alongside EXP10, LOG1, LN etc.)

apart from that, the prior review we (collectively) did a few weeks back 
pretty much called it.  i "guessed" that LOG1P and EXPM1 should go into 
Zftrans - should we "listen" to what commercial (custom-optimised) GPUs 
do?  don't know.

also, still outstanding: should RECIP-SQRT be its own separate extension 
(with one opcode)?  or is it worthwhile dropping into the Zftrans group, 
alongside RECIP, EXP2 and LOG2?  don't know.

l.

>