[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

Thu Aug 8 08:50:17 BST 2019

On Thursday, August 8, 2019 at 3:11:01 PM UTC+8, Jacob Lifshay wrote:
> On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate... at eecs.berkeley.edu> wrote:
> 
> Hi folks,
> 
> 
> We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.
> 
> 
> 
> Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation),

https://patents.google.com/patent/US9471305B2/en

This is really cool. Like CORDIC, it covers a huge range of operations. Mitch described it in the R-Sqrt thread.

> I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.

If we were talking about an embedded-only product, or a co-processor, the firmware requiring hard forked compilers or specialist dedicated compilers (like hoe NVIDIA and AMD do it), we would neither be having this discussion publicly nor putting forward a common Zftrans / Ztrig* spec.

This proposal is for *multiple* use cases *including* hybrid CPU/GPU, low power embedded specialist 3D, *and* standard UNIX (GNU libm).

In talking with Atif from Pixilica a few days ago he relayed to me the responses he got

https://www.pixilica.com/forum/event/risc-v-graphical-isa-at-siggraph-2019/p-1/dl-5d4322170924340017bfeeab

The attendance was *50* people at the BoF! He was expecting maybe two or three :) Some 3D engineers were doing transparent  polygons which requires checking the hits from both sides. Using *proprietary* GPUs they have a 100% performance penalty as it is a 2 pass operation.

Others have non-standard projection surfaces (spherical, not flat). No *way* proprietary hardware/software is going to cope with that.

Think Silicon has some stringent low power requirements for their embedded GPUs.

Machine Learning has another set of accuracy requirements (way laxer), where Jacon I think mentioned that atan in FP16 can be adequately implemented with a single cycle lookup table (something like that)

OpenCL even has specialist "fast inaccurate" SPIRV opcodes for some functions (SPIRV is part of Vulkan, and was originally based on LLVM IR). Search this page for "fast_" for examples:

https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

The point is: 3D, ML and OpenCL is *nothing* like the Embedded Platform or UNIX Platform world. Everything that we think we know about how it should be done is completely wrong, when it comes to this highly specialist and extremely diverse and unfortunately secretive market.

> 
> I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.

Altivec SSE / Vector nightmare, and RISCV is toast.

When we reach the layout milestone, the implementation will be frozen. We are not going to waste our sponsors' money: we have to act responsibly and get it right.  

Also, NLNet's funding, once allocated, is gone. We are therefore under time pressure to get the implementation done so that we can put in a second application for the layout.

Bottom line we are not going to wait around, the consequences are too severe (loss of access to funding).

L.