[libre-riscv-dev] fp special functions

Tue Aug 6 00:54:27 BST 2019

On Mon, Aug 5, 2019, 16:26 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On Monday, August 5, 2019, Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
> >
> >>>
> >>> note that atan(x) and atanpi(x) are just atan2(x, 1.0) and atan2pi(x,
> >>> 1.0),
> >>> so the atan and atanpi instructions are not needed
> >>
> >>
> >> Ok great, will move them to pseudo op aliases.
> >>
> >>
> > Hang on... there's no immed for loading 1.0 into an FP reg, it's one of
> > the downsides of RISCV, a FLD is a hard requirement.
> >
> > Hmmm....
> >
>
> "Hmm" means, depending on what an implementor chooses to do, cospi may be
> more efficient than cos, or vice-versa.
>
> As a standard, we don't know which, therefore, to not impose that on
> implementors, we need both (mandatory)
>
I think it's a pretty bad idea to have non-*pi versions of sin, cos, and
tan be required, since the modular reduction that's required as the first
step for any implementation method that I've heard of needs a very accurate
version of pi in order to produce correctly rounded answers.

if you think that the *pi instructions should not be preferred, maybe we
can split out the trig functions into several extensions that are not
dependent on each other:

Ztrigpi: trig. *-pi
sinpi
cospi
tanpi

Ztrignpi: trig non-*pi
sin
cos
tan

Zarctrigpi: arc-trig. *pi
atan2pi
asinpi
acospi

Zarctrignpi: arc-trig. non-*pi
atan2
asin
acos

I think that the Ztrignpi extension is totally impractical to implement for
our GPU due to f64 needing the remainder of dividing by the 1000+ bit
approximation of 2*pi.

All of Ztrigpi, Zarctrignpi, and Zarctrigpi are practical to implement.

note that using the Ztrigpi instructions to implement sin, cos, and tan
like sin(x) = sinpi(x * (1.0 / pi)) is good enough to meet the accuracy
requirements for Vulkan and OpenCL.

regarding frcp, fatan, and fatanpi, I think we should just use
fdiv/fatan2/fatan2pi and not even have a pseudo-instruction since the
compiler can optimize loading 1.0 into a fp register by moving the
flw/fcvt/fmv/etc. out of loops and stuff.

Jacob