Re: Floating-point formats and standards

Re: Floating-point formats and standards Aubrey Jaffer 05 Jan 2005 17:03 UTC
 | Date: Wed, 5 Jan 2005 04:22:27 -0800
 | From: "Bradd W. Szonye" <xxxxxx@szonye.com>
 |
 | A couple of corrections to the 754R description:
 |
 | Bradd W. Szonye wrote:
 | > New name    Sig   Exp   Old name   Currently implemented by
 | >
 | > binary16     11     5
 | > binary32     23     8   single     all systems (hardware)
 | > binary64     52    11   double     all systems (hardware)
 | > binary80     64    15   extended   all x86-based systems (hardware)
 | > binary128   112    15   quad       most RISC systems (software)
 |
 | In the current draft, the extended format is called "binaryx"
 | instead of "binary80." Implementations are supposed to provide at
 | least one high- precision format for intermediate calculations,
 | either binary128 (a basic format) or binaryx (an
 | implementation-defined format about 50% more precise than its best
 | basic format).

Can these intermediate formats be stored in memory?
Can vectored instructions read and store the intermediate formats?

 | One proposal recommends specifying binaryx as the x86 extended
 | format. I have no way of knowing for certain, but I suspect that it
 | will succeed, since x86 and quad are the only /de facto/ standards
 | for high-precision IEEE 754 flonums.
 |
 | That proposal also states which formats a system should support.
 |
 | For high-performance technical systems:
 |
 |     Binary64 is mandatory for computation and storage.
 |     Binary32 is recommended for low-precision, high-density storage.
 |     Binaryx is recommended for computation on x86-compatible systems.
 |     Binary128 is recommended for expression evaluation (i.e., temps).
 |
 | For commercial and financial systems:
 |
 |     Decimal128 is mandatory for computation and storage.
 |     Decimal32 and decimal64 are recommended for storage.
 |
 | The binary requirements match reality pretty well, except for the
 | binary128 recommendation. (Currently, only x86 systems use high-
 | precision temps, and they use binaryx instead of binary128.)

Since these proposed names have sizes in bits, I see little cause to
replace those sizes with the longer names or the cryptic
abbreviations.

While reading through 6.2 Numbers:

  Machine representations such as fixed point and floating point are
  referred to by names such as fixnum and flonum.

So here is a possible naming based on that.  Since flonum and fixnum
are machine representations, they specify attributes other than the
numerical type.  With this encoding rational fixnums could be added (a
la PL/1).

These abbreviations are pronounceable.  The fourth letter of the type
name is C for complex, R for real, (Q for rational?,) I for integer,
and N for nonnegative integer or natural number.

The "-" between the type name and precision could be removed.

Are fixnums and flonums necessarily binary?  Adding in a radix
indicator would gum up the works.

			      -=-=-=-=-

prototype   exact-                                      prefix
procedure   ness    element type                        (rank = n)
=========   =====   ============                        ==========
vector      any     #nA
A:floc-64   inexact IEEE 64.bit binary flonum complex   #nA:floc-64
A:floc-32   inexact IEEE 32.bit binary flonum complex   #nA:floc-32
A:flor-64   inexact IEEE 64.bit binary flonum real      #nA:flor-64
A:flor-32   inexact IEEE 32.bit binary flonum real      #nA:flor-32
A:fixi-64   exact   64.bit binary fixnum                #nA:fixi-64
A:fixi-32   exact   32.bit binary fixnum                #nA:fixi-32
A:fixi-16   exact   16.bit binary fixnum                #nA:fixi-16
A:fixi-8    exact   8.bit binary fixnum                 #nA:fixi-8
A:fixn-64   exact   64.bit nonnegative binary fixnum    #nA:fixn-64
A:fixn-32   exact   32.bit nonnegative binary fixnum    #nA:fixn-32
A:fixn-16   exact   16.bit nonnegative binary fixnum    #nA:fixn-16
A:fixn-8    exact   8.bit nonnegative binary fixnum     #nA:fixn-8
A:boolean           boolean                             #nA:boolean

A two-by-three array of nonnegative 16.bit integers is written:

#2A:fixn-16((0 1 2) (3 5 4))

Note that this is the external representation of an array, not an
expression evaluating to a array. Like vector constants, array
constants must be quoted:

'#2a:FIXN-16((0 1 2) (3 5 4))
               ==> #2A:fixn-16((0 1 2) (3 5 4))

Rank 0 arrays:

#0a sym
#0A:flor-32 237.0

Semantics

(array-dimensions '#2A:fixn-16((0 1 2) (3 5 4))) ==> (2 3)

An equivalent array could have been created by

(define ra (make-array (A:fixn-16) 2 3))
(array-set! ra 0 0 0)
(array-set! ra 1 0 1)
(array-set! ra 2 0 2)
(array-set! ra 3 1 0)
(array-set! ra 5 1 1)
(array-set! ra 4 1 2)

Literal array constants are immutable objects. It is an error to
attempt to store a new value into a location that is denoted by an
immutable object.

The following equivalences will be defined to alias SRFI-47 names to
the new ones. SRFI-47 should be amended or replaced to make these be
the array-prototype-procedures:

(define A:floc-64 ac64)
(define A:floc-32 ac32)
(define A:flor-64 ar64)
(define A:flor-32 ar32)
(define A:fixi-64 as64)
(define A:fixi-32 as32)
(define A:fixi-16 as16)
(define A:fixi-8  as8)
(define A:fixn-64 au64)
(define A:fixn-32 au32)
(define A:fixn-16 au16)
(define A:fixn-8  au8)
(define A:boolean at1)