[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

Thu Aug 8 02:17:37 BST 2019

An old guy at IBM (a Fellow) made a long and impassioned plea in a paper 
from the late 1970s or early 1980s that whenever something is put "into the 
instruction set" that the result be as accurate as possible. Look it up, 
it's a good read.

At the time I was working for a mini-computer company where a new 
implementation was not giving binary accurate results compared to an older 
generation. This was traced to an "enhancement" in the F32 and F64 accuracy 
from the new implementation. To a customer, they all wanted binary 
equivalence, even if the math was worse.

On the other hand, back when I started doing this (CPU design) the guys 
using floating point just wanted speed and they were willing to put up with 
not only IBM floating point (Hex normalization, and gard digit) but even 
CRAY floating point (CDC 6600, CDC 7600, CRAY 1) which was demonstrably 
WORSE in the numerics department.

In any event; to all. but 5 floating point guys in the world, a rounding 
error (compared to the correctly rounded result) occurring less often than 
3% of the time and no more than 1 ULP, is as accurate as they need (caveat: 
so long as the arithmetic is repeatable.) As witness, the FDIV <lack of> 
instruction in ITANIC had a 0.502 ULP accuracy (Markstein) and nobody 
complained.

My gut feeling tell me that the numericalists are perfectly willing to 
accept an error of 0.51 ULP RMS on transcendental calculations.
My gut feeling tell me that the numericalists are not willing to accept an 
error of 0.75 ULP RMS on transcendental calculations.
I have no feeling at all on where to draw the line.

On Wednesday, August 7, 2019 at 7:57:38 PM UTC-5, lkcl wrote:
>
>
>
> On Thursday, August 8, 2019 at 1:29:29 AM UTC+1, MitchAlsup wrote:
>>
>>
>>
>> On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:
>>>
>>> On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <
>>> isa... at groups.riscv.org> wrote:
>>>
>>>> Is this proposal going to <eventually> include::
>>>>
>>>> a) statement on required/delivered numeric accuracy per transcendental ?
>>>>
>>> From what I understand, they require correctly rounded results. We 
>>> should eventually state that somewhere. The requirement for correctly 
>>> rounded results is so the instructions can replace the corresponding 
>>> functions in libm (they're not just for GPUs) and for reproducibility 
>>> across implementations.
>>>
>>
>> Correctly rounded results will require a lot more difficult hardware and 
>> more cycles of execution.
>> Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 
>> bits for some of the harder functions.
>> Standard GPUs today are producing fully pipelined results with 5 cycle 
>> latency for F32 (with 1-4 bits of imprecision)
>> Based on my knowledge of the situation, requiring IEEE 754 correct 
>> rounding will double the area of the transcendental unit, triple the area 
>> used for coefficients, and come close to doubling the latency.
>>
>
> hmmm... i don't know what to suggest / recommend here.  there's two 
> separate requirements: accuracy (OpenCL, numerical scenarios), and 3D GPUs, 
> where better accuracy is not essential.
>
> i would be tempted to say that it was reasonable to suggest that if you're 
> going to use FP32, expectations are lower so "what the heck".  however i 
> have absolutely *no* idea what the industry consensus is, here.
>
> i do know that you've an enormous amount of expertise and experience in 
> the 3D GPU area, Mitch.
>
> I can point you at (and have) the technology to perform most of these to 
>> the accuracy stated above in 5 cycles F32.
>>
>> I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS 
>> including argument reduction in 19 cycles, POW in 34 cycles while achieving 
>> "faithfull rounding" of the result in any of the IEEE 754-2008 rounding 
>> modes and using a floating point unit essentially the same size as an FMAC 
>> unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek 
>> argument reduction, which costs 4-cycles and allows for "silly arguments to 
>> be properly processed:: COS( 6381956970095103×2^797) = 
>> -4.68716592425462761112×10-19 
>>
>
> yes please.  
>
> there will be other implementors of this Standard that will want to make a 
> different call on which direction to go.
>
> l.
>
>>