[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

Thu Aug 8 01:57:38 BST 2019

On Thursday, August 8, 2019 at 1:29:29 AM UTC+1, MitchAlsup wrote:
>
>
>
> On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:
>>
>> On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <
>> isa... at groups.riscv.org> wrote:
>>
>>> Is this proposal going to <eventually> include::
>>>
>>> a) statement on required/delivered numeric accuracy per transcendental ?
>>>
>> From what I understand, they require correctly rounded results. We should 
>> eventually state that somewhere. The requirement for correctly rounded 
>> results is so the instructions can replace the corresponding functions in 
>> libm (they're not just for GPUs) and for reproducibility across 
>> implementations.
>>
>
> Correctly rounded results will require a lot more difficult hardware and 
> more cycles of execution.
> Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 
> bits for some of the harder functions.
> Standard GPUs today are producing fully pipelined results with 5 cycle 
> latency for F32 (with 1-4 bits of imprecision)
> Based on my knowledge of the situation, requiring IEEE 754 correct 
> rounding will double the area of the transcendental unit, triple the area 
> used for coefficients, and come close to doubling the latency.
>

hmmm... i don't know what to suggest / recommend here.  there's two 
separate requirements: accuracy (OpenCL, numerical scenarios), and 3D GPUs, 
where better accuracy is not essential.

i would be tempted to say that it was reasonable to suggest that if you're 
going to use FP32, expectations are lower so "what the heck".  however i 
have absolutely *no* idea what the industry consensus is, here.

i do know that you've an enormous amount of expertise and experience in the 
3D GPU area, Mitch.

I can point you at (and have) the technology to perform most of these to 
> the accuracy stated above in 5 cycles F32.
>
> I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS 
> including argument reduction in 19 cycles, POW in 34 cycles while achieving 
> "faithfull rounding" of the result in any of the IEEE 754-2008 rounding 
> modes and using a floating point unit essentially the same size as an FMAC 
> unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek 
> argument reduction, which costs 4-cycles and allows for "silly arguments to 
> be properly processed:: COS( 6381956970095103×2^797) = 
> -4.68716592425462761112×10-19 
>

yes please.  

there will be other implementors of this Standard that will want to make a 
different call on which direction to go.

l.

>