[libre-riscv-dev] [isa-dev] Re: FP transcendentals (trigonometry, root/exp/log) proposal

Tue Aug 13 01:11:20 BST 2019

On Tuesday, August 13, 2019 at 1:52:16 AM UTC+8, MitchAlsup wrote:
> On Sunday, August 11, 2019 at 10:20:28 PM UTC-5, lkcl wrote:
> https://libre-riscv.org/ztrans_proposal/#khronos_equiv
> 
> 
> I would like to point out that the general implementations of ATAN2 do a bunch of special case checks and then simply call ATAN.

Appreciated.  I recorded these insights on the page (to move offpage, to discussion, at a later point).

> The bottom line is that I think you are choosing to make too many of these into OpCodes, making the hardware
> function/calculation unit (and sequencer) more complicated that necessary.

We do have to be careful to ensure that multiple disparate Platform implementors are happy, and that tends to suggest that the extension remains close to a RISCV ISA paradigm.

> ----------------------------------------------------------------------------------------------------------------------------------------------------
> I might suggest that if there were a way for a calculation to be performed and the result of that calculation
> chained to a subsequent calculation such that the precision of the result-becomes-operand is wider than
> what will fit in a register, then you can dramatically reduce the count of instructions in this category while retaining
> acceptable accuracy:
> 
> 
>      z = x / y
> can be calculated as::
>      z = x × (1/y)
> 
> 
> Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008 accurate, but GPUs want speed and
> 1/y is fully pipelined (F32) while x/y cannot be (at reasonable area).

Sigh yehhh this is... ok let me put it this way. If we were doing a from scratch dedicated GPU ISA (along the lines of proprietary GPUs, with associated software  RPC / IPC Marshalling system between the completely disparate ISAs) I would in absolutely no way start from a RISC-V base.

That's not because the RISCV Foundation is a pain to deal with, it's *technical* reasons, namely that it is a retrofit into an ISA that was designed for a completely different market than 3D.

> Given that one has the ability to carry (and process) more fraction bits, one can then do high precision
> multiplies of  π or other transcendental radixes.
> 
> 
> And GPUs have been doing this almost since the dawn of 3D.

Appreciated.  Background, first.  Can skip if short of time

---

Basically what you are recommending is a microcode ISA. This is something that is on the table as an option (an idea floated by Atif from Pixilica), and one that we are sort-of looking to put into the hardware of the Libre RISCV ALUs, by having a long "opcode" that activates *parts* of the pipeline (pre and post FP normalisation and special cases) so that it can be share between INT and FP.

Also, 64 bit will be performed by "recycling" intermediary results back through the pipeline, again under the control of that microcode-like long "opcode". It's a FSM with automatic operand forwarding in other words.

What you describe - the special cases that turn ATAN2 into ATAN - could be performed conveniently within the "recycling" paradigm by carrying out the special cases as one "cycle", the DIV as another (or the mul and the 1/x as two) and finally the FSM hands the intermediate over to ATAN.

The nice thing about this microarchitecture is that the intermediate data can be of any width, as well as contain any number of intermediate operands.

My feeling is - and this is not ruling out the possibility - that microcode ops, exposed to the actual ISA level - would not only need a lot of thought, they'd need special attention to be paid to the register file (no longer 32 bits, it would be 36 or some other arbitrary width sufficient to store the intermediary results, efficiently), and more, as well.

Complicated, and also concern at deviating from RISCV's ISA, significantly. Maybe even *increasing* the number of opcodes, due to fragmentation of specialist micro operations (such as ATAN2 specialcases).

If those specialcases were done as RISCV operations, that's a *lot* of instructions to trade off against simply having ATAN2.

Overall then I think what I am talking myself into is support for the pseudo-microcode-like FSM engine within our design, with associated "feedback" back to the beginning of the pipeline(s).  It is not a full blown microcode design, yet has a similar effect, just without needing to expose microcode details to the actual ISA.

Other implementors may choose to do things differently, particularly those that stick to the UNIX Platform Accuracy profile.

So that is background.

---

We therefore I think have a case for bringing back ATAN and including ATAN2.

The reason is that whilst a microcode-like GPU-centric platform would do ATAN2 in terms of ATAN, a UNIX-centric platform would do it the other way round.

(that is the hypothesis, to be evaluated for correctness. feedback requested).

Thie because we cannot compromise or prioritise one platfrom's speed/accuracy over another. That is not reasonable or desirable, to penalise one implementor over another.

Thus, all implementors, to keep interoperability, must both have both opcodes and may choose, at the architectural and routing level, which one to implement in terms of the other.

Allowing implementors to choose to add either opcode and let traps sort it out leaves an uncertainty in the software developer's mind: they cannot trust the hardware, available from many vendors, to be performant right across the board.

Standards are a pig.

L.