[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

Thu Aug 8 01:29:29 BST 2019

On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:
>
> On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <
> isa... at groups.riscv.org <javascript:>> wrote:
>
>> Is this proposal going to <eventually> include::
>>
>> a) statement on required/delivered numeric accuracy per transcendental ?
>>
> From what I understand, they require correctly rounded results. We should 
> eventually state that somewhere. The requirement for correctly rounded 
> results is so the instructions can replace the corresponding functions in 
> libm (they're not just for GPUs) and for reproducibility across 
> implementations.
>

Correctly rounded results will require a lot more difficult hardware and 
more cycles of execution.
Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 
bits for some of the harder functions.
Standard GPUs today are producing fully pipelined results with 5 cycle 
latency for F32 (with 1-4 bits of imprecision)
Based on my knowledge of the situation, requiring IEEE 754 correct rounding 
will double the area of the transcendental unit, triple the area used for 
coefficients, and come close to doubling the latency.

>
> b) a reserve on the OpCode space for the double precision equivalents ?
>>
> the 2 bits right below the funct5 field select from:
> 00: f32
> 01: f64
> 10: f16
> 11: f128
>
> so f64 is definitely included.
>
> see https://libre-riscv.org/rv_major_opcode_1010011/#index2h1
> see table 11.3 in Volume I: RISC-V Unprivileged ISA V20190608-Base-Ratified
>
> it would probably be a good idea to split the trancendental extensions 
> into separate f32, f64, f16, and f128 extensions, since some 
> implementations may want to only implement them for f32 while still 
> implementing the D (f64 arithmetic) extension.
>
> c) a statement on <approximate> execution time ?
>>
> that would be microarchitecture specific. since this is supposed to be an 
> inter-vendor (icr the right term) specification, that would be up to the 
> implementers. I would assume that they are at least faster then a 
> soft-float implementation (since that's usually the whole point of 
> implementing them).
>
> For our implementation, I'd imagine something between 8 and 40 clock 
> cycles for most of the operations. sin, cos, and tan (but not sinpi and 
> friends) may require much more than that for large inputs for range 
> reduction to accurately calculate x mod 2*pi, hence why we are thinking of 
> implementing sinpi, cospi, and tanpi instead (since they require 
> calculating x mod 2, which is much faster and simpler).
>

I can point you at (and have) the technology to perform most of these to 
the accuracy stated above in 5 cycles F32.

I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS 
including argument reduction in 19 cycles, POW in 34 cycles while achieving 
"faithfull rounding" of the result in any of the IEEE 754-2008 rounding 
modes and using a floating point unit essentially the same size as an FMAC 
unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek 
argument reduction, which costs 4-cycles and allows for "silly arguments to 
be properly processed:: COS( 6381956970095103×2^797) = 
-4.68716592425462761112×10-19 

Faithful rounding is not IEEE 754 correct. The unit I have designed makes 
an IEEE rounding error about once every 171 calculations.

>
> You may have more transcendentals than necessary::
>> 1) for example all of the inverse hyperbolic can be calculated to 
>> GRAPHICs numeric quality with short sequences of already existing 
>> transcendentals
>> ..... ASINH( x ) = ln( x + SQRT(x**2+1) )
>>
> That's why the hyperbolics extension is split out into a separate 
> extension. Also, a single instruction may be much faster since it can 
> calculate it all as one operation (cordic will work) rather than requiring 
> several slow operations sqrt/div and log.
>
> 2) LOG(x) = LOGP1(x) + 1.0
>> ... EXP(x) = EXPM1(x-1.0)
>>
>> That is:: LOGP1 and EXPM1 provide greater precision (especially when the 
>> result is near zero) than their sister functions, and the compiler can 
>> easily add the additional instruction to the instruction stream where 
>> appropriate.
>>
> for the implementation techniques I know for log/exp, implementing both 
> log/exp and logp1/expm1 is a slight increase in complexity compared to only 
> one or the other (changing constants for polynomial/lut-based 
> implementations and for cordic). I think it's worth saving the extra 
> instructions for the common case of implementing pow (where you need 
> log/exp) and logp1/expm1 is not worth getting rid of due to the small 
> additional cost and additional accuracy obtained.
>
> Jacob Lifshay
>