SRFI-56: Binary I/O
Alex Shinn
This SRFI extends Scheme with procedures to read and write binary data to and from ports, including utility procedures for writing various integer and floating point values in both big and little endian formats.
SRFI-36 provides the optional I/O Conditions used in the procedures below. The reference implementation uses portable SRFI-33 bitwise procedures.
Some Schemes may wish to distinguish between binary and non-binary ports as in Common-Lisp. As these can be layered on top of the current ports this may better be relegated to a separate SRFI.
Currently only standard big-endian and little-endian byte layouts are supported, though the API is forwards compatible with new endian orders.
I'm specifically seeking feedback on library procedure names, and the behavior of library procedures when insufficient input is available.
R5RS provides no portable means of reading or writing binary data, which is a prerequisite for handling binary data formats, implementing databases, creating encoding conversion libraries, among other uses typically required of programming languages.
We extend Scheme with four new I/O primitives:
read-byte
[port]
write-byte
int [port]
peek-byte
[port]
byte-ready?
[port]
which behave similar to their R5RS -char analogs except that they take and return integer values representing a single octet from the port. Specifically, an octet is 8 bits (one byte), with a resulting range of [0-255]. WRITE-BYTE signals an error if not passed an integer within this range.
For Schemes that use ASCII or any of the single byte encodings (e.g. ISO-8859-*) as the native character encoding, and don't change the integer value of the characters from the native octet value, these new procedures can be defined as follows:
(define (read-byte . opt) (let ((c (apply read-char opt))) (if (eof-object? c) c (char->integer c))))
(define (write-byte int . opt) (apply write-char (integer->char int) opt))
(define (peek-byte . opt) (let ((c (apply peek-char opt))) (if (eof-object? c) c (char->integer c))))
(define byte-ready? char-ready?)Schemes that use multi-byte encodings or don't handle arbitrary octets in I/O ports will have to define these as primitives. Moreover, in such cases ports can be corrupted by use of these procedures. Tagging ports as specifically intended for character I/O, binary I/O, or both is beyond the scope of this SRFI. Instead we recommend that implementations signal an error on any of READ-CHAR, PEEK-CHAR or CHAR-READY? when an invalid byte sequence is detected.
Note that CHAR-READY? should only return #t if a full character value is available. If the beginning of a valid multiple octet sequence is found but no additional octets are in the input port, then #f is returned and no error is signalled. BYTE-READY? can be used if you only wish to test the availability of any data regardless of character validity.
Schemes supporting SRFI-36 I/O Conditions may use the following condition for the case of an invalid encoding sequence:
(define-condition-type &i/o-encoding-error &i/o-read-error i/o-encoding-error? (byte-sequence i/o-error-byte-sequence) (encoding i/o-error-encoding))
The above extensions are sufficient to handle all forms of binary I/O, however are very low-level. We also provide the following 40 library procedures, which can be defined in terms of the above, although Schemes concerned about efficiency will probably wish to implement them at a lower level.
Procedures are described below with their parameter lists. Parameters in [ brackets ] are optional and may be omitted or passed a value of #f to revert to the default value. The default value of an input port is always the result of (current-input-port) and of an output port is (current-output-port).
Most of the procedures below accept an optional ENDIAN parameter, which is a symbol defined to be either 'big-endian or 'little-endian. This interface allows for future addition of endian types such as 'middle-endian-3412 where needed, though this SRFI does not define them.
When not given the ENDIAN parameter defaults to the appropriate value for the current system's architecture. This value can be queried with the procedure:
default-endian
read-binary-uint
size [port] [endian]
Read an unsigned integer of SIZE octets from PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture). If fewer than SIZE octets are available in the port return the eof-object.
read-binary-sint
size [port] [endian]
Read a signed integer in twos complement form of SIZE octets from PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture).
Schemes are not required to support the full numeric tower, and in particular if they do not support bignums they are unlikely to be able to provide the full range of 32-bit values. In this case care should be taken that when reading values, if the final result fits within the implementation's supported range the value should be read properly. In particular, small negative values should be supported, even though they may first be interpreted as large positive values before twos complement conversion.
If the resulting integer would not be supported by the Scheme's numeric range then the result should be the same as when an arithmetic operation produces an result outside the supported range, such as signalling and error or causing overflow.
Schemes that choose to use optimization strategies that limit their numeric range would be free to provide read procedures returning disjoint types. For instance, Bigloo could provide a read-binary-elong procedure to read an elong object (a Bigloo hardware integer).
We provide the following predefined read sizes. Although the reference implementation defines them in terms of the general read-binary-uint above, significant performance gains are possible if you hand code them to the appropriate size.
read-binary-uint8
[port] [endian]
read-binary-uint16
[port] [endian]
read-binary-uint32
[port] [endian]
read-binary-uint64
[port] [endian]
Read and return an unsigned binary integer as in read-binary-uint, using the corresponding numeric suffix of the procedure name as SIZE.
read-binary-sint8
[port] [endian]
read-binary-sint16
[port] [endian]
read-binary-sint32
[port] [endian]
read-binary-sint64
[port] [endian]
Read and return a signed binary integer as in read-binary-sint, using the corresponding numeric suffix of the procedure name as SIZE.
write-binary-uint
size int [port] [endian]
Write unsigned integer INT of SIZE octets to PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture).
write-binary-sint
size int [port] [endian]
Write signed integer INT of SIZE octets to PORT (default current-input-port) with endianness ENDIAN (default to that of the local architecture) in twos complement form.
write-binary-uint8
int [port] [endian]
write-binary-uint16
int [port] [endian]
write-binary-uint32
int [port] [endian]
write-binary-uint64
int [port] [endian]
Write an unsigned binary integer as in write-binary-uint, using the corresponding numeric suffix of the procedure name as SIZE.
write-binary-sint8
int [port] [endian]
write-binary-sint16
int [port] [endian]
write-binary-sint32
int [port] [endian]
write-binary-sint64
int [port] [endian]
Write a signed binary integer as in write-binary-sint, using the corresponding numeric suffix of the procedure name as SIZE.
For portability between different architectures it can be useful to use the standard "network" byte encoding (big-endian). On big-endian architectures these can simply be aliases for the general versions above.
read-network-uint16
[port]
read-network-uint32
[port]
read-network-uint64
[port]
read-network-sint16
[port]
read-network-sint32
[port]
read-network-sint64
[port]
write-network-uint16
int [port]
write-network-uint32
int [port]
write-network-uint64
int [port]
write-network-sint16
int [port]
write-network-sint32
int [port]
write-network-sint64
int [port]
Since Schemes may support unlimited size bignums it is useful to support the binary encoding of such values.
A BER (Basic Encoding Rules from X.690) compressed integer is an unsigned integer in base 128, most significant digit first, where the high bit is set on all but the final (least significant) byte. Thus any size integer can be encoded, but the encoding is efficient and small integers don't take up any more space than they would in normal char/short/int encodings.
Examples of integers converted to BER byte sequences:
3 => #x03 555 => #x84 #x2B 123456789 => #xBA #xEF #x9A #x15
read-ber-integer
[port]
Reads and returns an exact integer, or the eof-object if no bytes without the high bit set (i.e. less than 128) are found.
write-ber-integer
int [port]
Writes INT to the specified output port in BER format. Signals an error if INT is not a positive integer.
Floating point binary formats are much more complicated than simple two's complement integer formats, typically divided into a sign bit, exponent field and mantissa field, optionally using a hidden bit and different rounding behavior. Because of this we do not define general purpose floating point operations but simply provide the most common formats, IEEE-754 single and double precision floats.
On some architectures floating point is handled by a separate co-processor and is not guaranteed to use the same endian as integer values. We therefore use a separate default endian for floating point numbers.
default-float-endian
Returns the default endianness used for floating point procedures as a symbol, using the same symbol names as above for integer endians.
read-ieee-float32
[port] [endian]
read-ieee-float64
[port] [endian]
Reads an IEEE float, single or double precision respectively, from PORT in the given ENDIAN, and returns the corresponding inexact real value, or the eof-object if insufficient data is present.
If the Scheme implementation supports +/- Infinity or NaN, as IEEE floats or otherwise, the Scheme implementation may return these values for the IEEE defined bit patterns on read-ieee-float.
write-ieee-float32
real [port] [endian]
write-ieee-float64
real [port] [endian]
Write REAL to PORT in the given ENDIAN using IEEE floating point representation, single or double precision respectively. Signals an error if REAL is not a real value.
If the Scheme implementation supports +/- Infinity or NaN, as IEEE floats or otherwise, the Scheme implementation may accept these values for REAL and write the corresponding IEEE defined bit patterns.
The reference implementation is available at
http://srfi.schemers.org/srfi-56/srfi-56.scmand has been placed under an open BSD-style license.
A corresponding test suite can be found at
http://srfi.schemers.org/srfi-56/test-srfi-56.scmThe reference implementation has been tested with the following Schemes: Bigloo, Chicken, Gauche, Guile, Kawa, MzScheme, SISC.
[At time of writing the tests do not all pass for all Schemes, largely do to issues with integer sizes (notably you may need to comment out some of the bignum literals in the test source) and varying floating point precision.]The reference implementation uses only portable R5RS procedures and should work unmodified in any compliant Scheme. The API for a subset of SRFI-33 bitwise procedures was used but a portable implementation of these procedures included in the source itself, so the Scheme need not support SRFI-33 natively.
Care has been taken that intermediate values remain smaller than the final result, so that Schemes with limited numeric ranges will still read and write properly the values they do support.
The default endian for both integers and floating point numbers is set to 'little-endian at the start of the file, which is correct for x86 platforms. Most other architectures are 'big-endian and will need to be changed accordingly.
The fastest implementations will of course be native (C or otherwise compiled), especially for the floating point operations. However, because it is fairly extensive, as well as tested and portable, many Schemes will choose to use some or all of the reference implementation directly. In this case the following optimizations can be made:
I have not done any benchmarking but I suspect that in most cases the bottleneck is likely to be I/O rather than CPU, and extensive optimization (beyond 1 and 2 above) may not be worth the effort.
I would like to thank Oleg Kiselyov, Shiro Kawai and David Rush for their help in the pre-draft discussion, and of course all those who provide comments, questions and flames in the upcoming SRFI discussion.
R. Kelsey, W. Clinger, J. Rees (eds.), Revised^5 Report on the Algorithmic Language Scheme, Higher-Order and Symbolic Computation, 11(1), September, 1998 and ACM SIGPLAN Notices, 33(9), October, 1998. http://www.schemers.org/Documents/Standards/R5RS/.
Common Lisp: the Language Guy L. Steele Jr. (editor). Digital Press, Maynard, Mass., second edition 1990. http://www.elwood.com/alu/table/references.htm#cltl2. http://www.harlequin.com/education/books/HyperSpec/.
ISO Standard C ISO/IEC 9899:1999 http://www.sics.se/~pd/ISO-C-FDIS.1999-04.pdf.
ON HOLY WARS AND A PLEA FOR PEACE Danny Cohen, IEN 137, April 1980. http://www.networksorcery.com/enp/ien/ien137.txt. http://www.isi.edu/in-notes/ien/ien137.txt.
Various IEEE-754 references and a calculator in JavaScript. http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html.
ASN.1 encoding rules: Specification of Basic Encoding Rules (BER), Canonical Encoding Rules (CER) and Distinguished Encoding Rules (DER), February, 2002. http://www.itu.int/ITU-T/studygroups/com17/languages/.
Various binary parsing utilities for Scheme. http://okmij.org/ftp/Scheme/binary-io.html.
Oleg Kiselyov, Reading IEEE binary floats in R5RS Scheme. Article from comp.lang.scheme, on 8 March, 2000, Message-ID: <8a4h56$oqu$1@nnrp1.deja.com>. http://okmij.org/ftp/Scheme/reading-IEEE-floats.txt.
Copyright (C) Alex Shinn 2004. All Rights Reserved.
This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Scheme Request For Implementation process or editors, except as needed for the purpose of developing SRFIs in which case the procedures for copyrights defined in the SRFI process must be followed, or as required to translate it into languages other than English.
The limited permissions granted above are perpetual and will not be revoked by the authors or their successors or assigns.
This document and the information contained herein is provided on an "AS IS" basis and THE AUTHOR AND THE SRFI EDITORS DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.