Title

SRFI-56: Binary I/O

Author

Alex Shinn

Abstract

This SRFI extends Scheme with procedures to read and write binary data to and from ports, including utility procedures for writing various integer and floating point values in both big and little endian formats.

Related SRFIs

SRFI-36 provides the optional I/O Conditions used in the procedures below. The reference implementation uses portable SRFI-33 bitwise procedures.

Issues

Some Schemes may wish to distinguish between binary and non-binary ports as in Common-Lisp. As these can be layered on top of the current ports this may better be relegated to a separate SRFI.

Currently only standard big-endian and little-endian byte layouts are supported, though the API is forwards compatible with new endian orders.

I'm specifically seeking feedback on library procedure names, and the behavior of library procedures when insufficient input is available.

Table of Contents

Rationale

R5RS provides no portable means of reading or writing binary data, which is a prerequisite for handling binary data formats, implementing databases, creating encoding conversion libraries, among other uses typically required of programming languages.

Procedure Index

Specification

We extend Scheme with four new I/O primitives:

read-byte [port]
write-byte int [port]
peek-byte [port]
byte-ready? [port]

which behave similar to their R5RS -char analogs except that they take and return integer values representing a single octet from the port. Specifically, an octet is 8 bits (one byte), with a resulting range of [0-255]. WRITE-BYTE signals an error if not passed an integer within this range.

For Schemes that use ASCII or any of the single byte encodings (e.g. ISO-8859-*) as the native character encoding, and don't change the integer value of the characters from the native octet value, these new procedures can be defined as follows:

    (define (read-byte . opt)
      (let ((c (apply read-char opt)))
        (if (eof-object? c) c (char->integer c))))
    (define (write-byte int . opt)
      (apply write-char (integer->char int) opt))
    (define (peek-byte . opt)
      (let ((c (apply peek-char opt)))
        (if (eof-object? c) c (char->integer c))))
    (define byte-ready? char-ready?)
Schemes that use multi-byte encodings or don't handle arbitrary octets in I/O ports will have to define these as primitives. Moreover, in such cases ports can be corrupted by use of these procedures. Tagging ports as specifically intended for character I/O, binary I/O, or both is beyond the scope of this SRFI. Instead we recommend that implementations signal an error on any of READ-CHAR, PEEK-CHAR or CHAR-READY? when an invalid byte sequence is detected.

Note that CHAR-READY? should only return #t if a full character value is available. If the beginning of a valid multiple octet sequence is found but no additional octets are in the input port, then #f is returned and no error is signalled. BYTE-READY? can be used if you only wish to test the availability of any data regardless of character validity.

Schemes supporting SRFI-36 I/O Conditions may use the following condition for the case of an invalid encoding sequence:

   (define-condition-type &i/o-encoding-error &i/o-read-error
     i/o-encoding-error?
     (byte-sequence i/o-error-byte-sequence)
     (encoding i/o-error-encoding))

Library Procedures

The above extensions are sufficient to handle all forms of binary I/O, however are very low-level. We also provide the following 40 library procedures, which can be defined in terms of the above, although Schemes concerned about efficiency will probably wish to implement them at a lower level.

Procedures are described below with their parameter lists. Parameters in [ brackets ] are optional and may be omitted or passed a value of #f to revert to the default value. The default value of an input port is always the result of (current-input-port) and of an output port is (current-output-port).

Endianness

Most of the procedures below accept an optional ENDIAN parameter, which is a symbol defined to be either 'big-endian or 'little-endian. This interface allows for future addition of endian types such as 'middle-endian-3412 where needed, though this SRFI does not define them.

When not given the ENDIAN parameter defaults to the appropriate value for the current system's architecture. This value can be queried with the procedure:

default-endian

General Reading

read-binary-uint size [port] [endian]

Read an unsigned integer of SIZE octets from PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture). If fewer than SIZE octets are available in the port return the eof-object.

read-binary-sint size [port] [endian]

Read a signed integer in twos complement form of SIZE octets from PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture).

Schemes are not required to support the full numeric tower, and in particular if they do not support bignums they are unlikely to be able to provide the full range of 32-bit values. In this case care should be taken that when reading values, if the final result fits within the implementation's supported range the value should be read properly. In particular, small negative values should be supported, even though they may first be interpreted as large positive values before twos complement conversion.

If the resulting integer would not be supported by the Scheme's numeric range then the result should be the same as when an arithmetic operation produces an result outside the supported range, such as signalling and error or causing overflow.

Schemes that choose to use optimization strategies that limit their numeric range would be free to provide read procedures returning disjoint types. For instance, Bigloo could provide a read-binary-elong procedure to read an elong object (a Bigloo hardware integer).

Predefined Read Sizes

We provide the following predefined read sizes. Although the reference implementation defines them in terms of the general read-binary-uint above, significant performance gains are possible if you hand code them to the appropriate size.

read-binary-uint8 [port] [endian]
read-binary-uint16 [port] [endian]
read-binary-uint32 [port] [endian]
read-binary-uint64 [port] [endian]

Read and return an unsigned binary integer as in read-binary-uint, using the corresponding numeric suffix of the procedure name as SIZE.

read-binary-sint8 [port] [endian]
read-binary-sint16 [port] [endian]
read-binary-sint32 [port] [endian]
read-binary-sint64 [port] [endian]

Read and return a signed binary integer as in read-binary-sint, using the corresponding numeric suffix of the procedure name as SIZE.

General Writing

write-binary-uint size int [port] [endian]

Write unsigned integer INT of SIZE octets to PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture).

write-binary-sint size int [port] [endian]

Write signed integer INT of SIZE octets to PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture) in twos complement form.

Predefined Write Sizes

write-binary-uint8 int [port] [endian]
write-binary-uint16 int [port] [endian]
write-binary-uint32 int [port] [endian]
write-binary-uint64 int [port] [endian]

Write an unsigned binary integer as in write-binary-uint, using the corresponding numeric suffix of the procedure name as SIZE.

write-binary-sint8 int [port] [endian]
write-binary-sint16 int [port] [endian]
write-binary-sint32 int [port] [endian]
write-binary-sint64 int [port] [endian]

Write a signed binary integer as in write-binary-sint, using the corresponding numeric suffix of the procedure name as SIZE.

Predefined Network Encodings

For portability between different architectures it can be useful to use the standard "network" byte encoding (big-endian). On big-endian architectures these can simply be aliases for the general versions above.

read-network-uint16 [port]
read-network-uint32 [port]
read-network-uint64 [port]

read-network-sint16 [port]
read-network-sint32 [port]
read-network-sint64 [port]

write-network-uint16 int [port]
write-network-uint32 int [port]
write-network-uint64 int [port]

write-network-sint16 int [port]
write-network-sint32 int [port]
write-network-sint64 int [port]

Bignum Encodings

Since Schemes may support unlimited size bignums it is useful to support the binary encoding of such values.

A BER (Basic Encoding Rules from X.690) compressed integer is an unsigned integer in base 128, most significant digit first, where the high bit is set on all but the final (least significant) byte. Thus any size integer can be encoded, but the encoding is efficient and small integers don't take up any more space than they would in normal char/short/int encodings.

Examples of integers converted to BER byte sequences:

            3 => #x03
          555 => #x84 #x2B
    123456789 => #xBA #xEF #x9A #x15

read-ber-integer [port]

Reads and returns an exact integer, or the eof-object if no bytes without the high bit set (i.e. less than 128) are found.

write-ber-integer int [port]

Writes INT to the specified output port in BER format. Signals an error if INT is not a positive integer.

IEEE Floating Point Encodings

Floating point binary formats are much more complicated than simple two's complement integer formats, typically divided into a sign bit, exponent field and mantissa field, optionally using a hidden bit and different rounding behavior. Because of this we do not define general purpose floating point operations but simply provide the most common formats, IEEE-754 single and double precision floats.

On some architectures floating point is handled by a separate co-processor and is not guaranteed to use the same endian as integer values. We therefore use a separate default endian for floating point numbers.

default-float-endian

Returns the default endianness used for floating point procedures as a symbol, using the same symbol names as above for integer endians.

read-ieee-float32 [port] [endian]
read-ieee-float64 [port] [endian]

Reads an IEEE float, single or double precision respectively, from PORT in the given ENDIAN, and returns the corresponding inexact real value, or the eof-object if insufficient data is present.

If the Scheme implementation supports +/- Infinity or NaN, as IEEE floats or otherwise, the Scheme implementation may return these values for the IEEE defined bit patterns on read-ieee-float.

write-ieee-float32 real [port] [endian]
write-ieee-float64 real [port] [endian]

Write REAL to PORT in the given ENDIAN using IEEE floating point representation, single or double precision respectively. Signals an error if REAL is not a real value.

If the Scheme implementation supports +/- Infinity or NaN, as IEEE floats or otherwise, the Scheme implementation may accept these values for REAL and write the corresponding IEEE defined bit patterns.

Implementation

The reference implementation is available at

    http://srfi.schemers.org/srfi-56/srfi-56.scm
and has been placed under an open BSD-style license.

A corresponding test suite can be found at

    http://srfi.schemers.org/srfi-56/test-srfi-56.scm
The reference implementation has been tested with the following Schemes: Bigloo, Chicken, Gauche, Guile, Kawa, MzScheme, SISC.

    [At time of writing the tests do not all pass for all Schemes,
    largely do to issues with integer sizes (notably you may need to
    comment out some of the bignum literals in the test source) and
    varying floating point precision.]
The reference implementation uses only portable R5RS procedures and should work unmodified in any compliant Scheme. The API for a subset of SRFI-33 bitwise procedures was used but a portable implementation of these procedures included in the source itself, so the Scheme need not support SRFI-33 natively.

Care has been taken that intermediate values remain smaller than the final result, so that Schemes with limited numeric ranges will still read and write properly the values they do support.

The default endian for both integers and floating point numbers is set to 'little-endian at the start of the file, which is correct for x86 platforms. Most other architectures are 'big-endian and will need to be changed accordingly.

Optimization

The fastest implementations will of course be native (C or otherwise compiled), especially for the floating point operations. However, because it is fairly extensive, as well as tested and portable, many Schemes will choose to use some or all of the reference implementation directly. In this case the following optimizations can be made:

  1. Use native equivalents of the SRFI-33 bitwise operators instead of the portable versions. At the very least the SLIB portable versions are likely to be better optimized.

  2. Make read-byte and write-byte native.

  3. Drop the asserts.

  4. Specialize the predefined size procedures rather than define them in terms of the more general operations.

  5. Check the high bytes to determine if the end result fits within the Schemes native fixnum range, and use specialized fixnum operations in that case. See the code in Oleg's TIFF library OLEG1.

    I have not done any benchmarking but I suspect that in most cases the bottleneck is likely to be I/O rather than CPU, and extensive optimization (beyond 1 and 2 above) may not be worth the effort.

Acknowledgements

I would like to thank Oleg Kiselyov, Shiro Kawai and David Rush for their help in the pre-draft discussion, and of course all those who provide comments, questions and flames in the upcoming SRFI discussion.

References

R5RS
      R. Kelsey, W. Clinger, J. Rees (eds.), Revised^5 Report on the
      Algorithmic Language Scheme, Higher-Order and Symbolic
      Computation, 11(1), September, 1998 and ACM SIGPLAN Notices,
      33(9), October, 1998.
      http://www.schemers.org/Documents/Standards/R5RS/.
CommonLisp
      Common Lisp: the Language
      Guy L. Steele Jr. (editor).
      Digital Press, Maynard, Mass., second edition 1990.
      http://www.elwood.com/alu/table/references.htm#cltl2.
      http://www.harlequin.com/education/books/HyperSpec/.
ISO-C
      ISO Standard C ISO/IEC 9899:1999
      http://www.sics.se/~pd/ISO-C-FDIS.1999-04.pdf.
HOLY
      ON HOLY WARS AND A PLEA FOR PEACE
      Danny Cohen, IEN 137, April 1980.
      http://www.networksorcery.com/enp/ien/ien137.txt.
      http://www.isi.edu/in-notes/ien/ien137.txt.
IEEE-754
      Various IEEE-754 references and a calculator in JavaScript.
      http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html.
X.690
      ASN.1 encoding rules: Specification of Basic Encoding Rules
      (BER), Canonical Encoding Rules (CER) and Distinguished Encoding
      Rules (DER), February, 2002.
      http://www.itu.int/ITU-T/studygroups/com17/languages/.
OLEG1
      Various binary parsing utilities for Scheme.
      http://okmij.org/ftp/Scheme/binary-io.html.
OLEG2
      Oleg Kiselyov, Reading IEEE binary floats in R5RS Scheme.
      Article from comp.lang.scheme, on 8 March, 2000,
      Message-ID: <8a4h56$oqu$1@nnrp1.deja.com>.
      http://okmij.org/ftp/Scheme/reading-IEEE-floats.txt.

Copyright

Copyright (C) Alex Shinn 2004. All Rights Reserved.

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Scheme Request For Implementation process or editors, except as needed for the purpose of developing SRFIs in which case the procedures for copyrights defined in the SRFI process must be followed, or as required to translate it into languages other than English.

The limited permissions granted above are perpetual and will not be revoked by the authors or their successors or assigns.

This document and the information contained herein is provided on an "AS IS" basis and THE AUTHOR AND THE SRFI EDITORS DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.