[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal
luke.leighton at gmail.com
Thu Aug 8 01:57:38 BST 2019
On Thursday, August 8, 2019 at 1:29:29 AM UTC+1, MitchAlsup wrote:
> On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:
>> On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <
>> isa... at groups.riscv.org> wrote:
>>> Is this proposal going to <eventually> include::
>>> a) statement on required/delivered numeric accuracy per transcendental ?
>> From what I understand, they require correctly rounded results. We should
>> eventually state that somewhere. The requirement for correctly rounded
>> results is so the instructions can replace the corresponding functions in
>> libm (they're not just for GPUs) and for reproducibility across
> Correctly rounded results will require a lot more difficult hardware and
> more cycles of execution.
> Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4
> bits for some of the harder functions.
> Standard GPUs today are producing fully pipelined results with 5 cycle
> latency for F32 (with 1-4 bits of imprecision)
> Based on my knowledge of the situation, requiring IEEE 754 correct
> rounding will double the area of the transcendental unit, triple the area
> used for coefficients, and come close to doubling the latency.
hmmm... i don't know what to suggest / recommend here. there's two
separate requirements: accuracy (OpenCL, numerical scenarios), and 3D GPUs,
where better accuracy is not essential.
i would be tempted to say that it was reasonable to suggest that if you're
going to use FP32, expectations are lower so "what the heck". however i
have absolutely *no* idea what the industry consensus is, here.
i do know that you've an enormous amount of expertise and experience in the
3D GPU area, Mitch.
I can point you at (and have) the technology to perform most of these to
> the accuracy stated above in 5 cycles F32.
> I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS
> including argument reduction in 19 cycles, POW in 34 cycles while achieving
> "faithfull rounding" of the result in any of the IEEE 754-2008 rounding
> modes and using a floating point unit essentially the same size as an FMAC
> unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek
> argument reduction, which costs 4-cycles and allows for "silly arguments to
> be properly processed:: COS( 6381956970095103×2^797) =
there will be other implementors of this Standard that will want to make a
different call on which direction to go.
More information about the libre-riscv-dev