Multiple precisions of floating-point arithmetic Bradley Lucier (26 Feb 2006 18:17 UTC)
Re: Multiple precisions of floating-point arithmetic Bradley Lucier (26 Feb 2006 20:16 UTC)

Multiple precisions of floating-point arithmetic Bradley Lucier 26 Feb 2006 18:17 UTC

Some floating-point applications need greater-than-64-bit-precision
arithmetic; two are mentioned below.

Perhaps this SRFI should tackle the problem of providing floating-
point arithmetics of various precisions.  If we think this might be
needed, then the specially-named--operator approach for floating-
point arithmetic as suggested in this SRFI (and which I like, by the
way), does not seem to scale well.

Common Lisp has an approach which is perhaps cumbersome to use
properly and may be error prone, but it does allow for the
implementation and use of differing precisions of floating-point
arithmetic where they are useful.

Or perhaps one could use the naming convention "name" (default double
precision operation), "name"f (single-precision, 32-bit, operator),
and "name"l (long double, whether 80 bit extended precision, 128-bit
quad precision, or 128-bit pair-of-64-bit-doubles precision) for
operations as is done in C if one wants to use the special-name
approach.

Brad

Examples of effective use of 128-bit floating-point arithmetic:

The following problem was pointed out by Philip W Sharp at the
University of Auckland in a talk on the long-time simulation of the
solar system.

As computers get faster, round-off error accumulates more quickly,
and, indeed, scientists are reaching the end of usefulness of 64-bit
IEEE floating-point arithmetic for long-time simulations of the
behavior of the solar system.  There's a paper here that discusses
this issue:

http://anziamj.austms.org.au/V46/CTAC2004/Gra2/home.html

Basically, if you want to simulate the solar system for longer times
you'll need an underlying arithmetic with more accuracy.

Beyond using extended-precision arithmetic for accurate evaluation of
the elementary functions, this was the first "real" application that
I had heard of that needed more than 64-bit arithmetic.

Then Colin Percival published his paper "Rapid multiplication modulo
the sum and difference of highly composite numbers",

www.ams.org/mcom/2003-72-241/S0025-5718-02-01419-9/
S0025-5718-02-01419-9.pdf

which gives new bounds for the error in FFTs implemented in floating-
point arithmetic.  This allows you to use FFTs to implement bignum
arithmetic with inputs of size 256 * (1024)^2 bits in 64-bit IEEE
arithmetic with proven accuracy.  (Most codes for FFT bignum
arithmetic use number-theoretic FFTs on finite fields.)  This is not
as big as some applications would like, but with 128-bit arithmetic
(either so-called quad-precision with a 15 bit exponent and 113-bit
mantissa or IBM-type long-double implemented as a pair of doubles (so
with the same dynamic range as 64-bit IEEE arithmetic but with about
106 bits of precision)), one could very easily implement fast,
provably accurate bignum multiplication for sizes as big as one might
ever need (and I don't think I'll live long enough to see that
statement made false).

I think that, given the effort and expense put into designing fast
floating-point arithmetic units, bignum arithmetic built on floating-
point FFTs will, in the end, be faster than the number theoretic FFTs
now popular among the "really big bignum" folks.