Re: Floating-point formats and standards
Aubrey Jaffer 05 Jan 2005 17:03 UTC
| Date: Wed, 5 Jan 2005 04:22:27 -0800
| From: "Bradd W. Szonye" <xxxxxx@szonye.com>
|
| A couple of corrections to the 754R description:
|
| Bradd W. Szonye wrote:
| > New name Sig Exp Old name Currently implemented by
| >
| > binary16 11 5
| > binary32 23 8 single all systems (hardware)
| > binary64 52 11 double all systems (hardware)
| > binary80 64 15 extended all x86-based systems (hardware)
| > binary128 112 15 quad most RISC systems (software)
|
| In the current draft, the extended format is called "binaryx"
| instead of "binary80." Implementations are supposed to provide at
| least one high- precision format for intermediate calculations,
| either binary128 (a basic format) or binaryx (an
| implementation-defined format about 50% more precise than its best
| basic format).
Can these intermediate formats be stored in memory?
Can vectored instructions read and store the intermediate formats?
| One proposal recommends specifying binaryx as the x86 extended
| format. I have no way of knowing for certain, but I suspect that it
| will succeed, since x86 and quad are the only /de facto/ standards
| for high-precision IEEE 754 flonums.
|
| That proposal also states which formats a system should support.
|
| For high-performance technical systems:
|
| Binary64 is mandatory for computation and storage.
| Binary32 is recommended for low-precision, high-density storage.
| Binaryx is recommended for computation on x86-compatible systems.
| Binary128 is recommended for expression evaluation (i.e., temps).
|
| For commercial and financial systems:
|
| Decimal128 is mandatory for computation and storage.
| Decimal32 and decimal64 are recommended for storage.
|
| The binary requirements match reality pretty well, except for the
| binary128 recommendation. (Currently, only x86 systems use high-
| precision temps, and they use binaryx instead of binary128.)
Since these proposed names have sizes in bits, I see little cause to
replace those sizes with the longer names or the cryptic
abbreviations.
While reading through 6.2 Numbers:
Machine representations such as fixed point and floating point are
referred to by names such as fixnum and flonum.
So here is a possible naming based on that. Since flonum and fixnum
are machine representations, they specify attributes other than the
numerical type. With this encoding rational fixnums could be added (a
la PL/1).
These abbreviations are pronounceable. The fourth letter of the type
name is C for complex, R for real, (Q for rational?,) I for integer,
and N for nonnegative integer or natural number.
The "-" between the type name and precision could be removed.
Are fixnums and flonums necessarily binary? Adding in a radix
indicator would gum up the works.
-=-=-=-=-
prototype exact- prefix
procedure ness element type (rank = n)
========= ===== ============ ==========
vector any #nA
A:floc-64 inexact IEEE 64.bit binary flonum complex #nA:floc-64
A:floc-32 inexact IEEE 32.bit binary flonum complex #nA:floc-32
A:flor-64 inexact IEEE 64.bit binary flonum real #nA:flor-64
A:flor-32 inexact IEEE 32.bit binary flonum real #nA:flor-32
A:fixi-64 exact 64.bit binary fixnum #nA:fixi-64
A:fixi-32 exact 32.bit binary fixnum #nA:fixi-32
A:fixi-16 exact 16.bit binary fixnum #nA:fixi-16
A:fixi-8 exact 8.bit binary fixnum #nA:fixi-8
A:fixn-64 exact 64.bit nonnegative binary fixnum #nA:fixn-64
A:fixn-32 exact 32.bit nonnegative binary fixnum #nA:fixn-32
A:fixn-16 exact 16.bit nonnegative binary fixnum #nA:fixn-16
A:fixn-8 exact 8.bit nonnegative binary fixnum #nA:fixn-8
A:boolean boolean #nA:boolean
A two-by-three array of nonnegative 16.bit integers is written:
#2A:fixn-16((0 1 2) (3 5 4))
Note that this is the external representation of an array, not an
expression evaluating to a array. Like vector constants, array
constants must be quoted:
'#2a:FIXN-16((0 1 2) (3 5 4))
==> #2A:fixn-16((0 1 2) (3 5 4))
Rank 0 arrays:
#0a sym
#0A:flor-32 237.0
Semantics
(array-dimensions '#2A:fixn-16((0 1 2) (3 5 4))) ==> (2 3)
An equivalent array could have been created by
(define ra (make-array (A:fixn-16) 2 3))
(array-set! ra 0 0 0)
(array-set! ra 1 0 1)
(array-set! ra 2 0 2)
(array-set! ra 3 1 0)
(array-set! ra 5 1 1)
(array-set! ra 4 1 2)
Literal array constants are immutable objects. It is an error to
attempt to store a new value into a location that is denoted by an
immutable object.
The following equivalences will be defined to alias SRFI-47 names to
the new ones. SRFI-47 should be amended or replaced to make these be
the array-prototype-procedures:
(define A:floc-64 ac64)
(define A:floc-32 ac32)
(define A:flor-64 ar64)
(define A:flor-32 ar32)
(define A:fixi-64 as64)
(define A:fixi-32 as32)
(define A:fixi-16 as16)
(define A:fixi-8 as8)
(define A:fixn-64 au64)
(define A:fixn-32 au32)
(define A:fixn-16 au16)
(define A:fixn-8 au8)
(define A:boolean at1)