binary format for Scheme data
Marc Feeley 06 Oct 2019 19:33 UTC
Communication is an interesting topic that ties in to the work I have done on Termite Scheme, which is a language extension to Gambit meant for programming distributed systems. In distributed systems it is important to have efficient serialization/deserialization operations, and in the context of Termite this means a binary format for representing Scheme data.
It would be nice to have a common binary format, at least for Scheme data with an external representation and cyclic/shared data so that data can be communicated between different implementations of Scheme.
Gambit has a compact format that can represent all Scheme data including procedures and continuations (useful for distributing code between nodes of a distributed system). The procedures object->u8vector and u8vector->object perform the serialization and deserialization operations respectively:
procedure (object->u8vector obj [encoder])
procedure (u8vector->object u8vector [decoder])
The procedure object->u8vector returns a u8vector that contains the sequence of bytes
that encodes the object obj. The procedure u8vector->object decodes the sequence of
bytes contained in the u8vector u8vector, which was produced by the procedure
object->u8vector, and reconstructs an object structurally equal to the original object.
In other words the procedures object->u8vector and u8vector->object respectively perform
serialization and deserialization of Scheme objects. Note that some objects are
non-serializable (e.g. threads, wills, some types of ports, and any object containing
a non-serializable object).
The optional encoder and decoder parameters are single parameter procedures which default
to the identity function. The encoder procedure is called during serialization. As the
serializer walks through obj, it calls the encoder procedure on each sub-object X that
is encountered. The encoder transforms the object X into an object Y that will be serialized
instead of X. Similarly the decoder procedure is called during deserialization. When an
object Y is encountered, the decoder procedure is called to transform it into the object
X that is the result of deserialization.
The encoder and decoder procedures are useful to customize the serialized representation
of objects. In particular, it can be used to define the semantics of serializing objects,
such as threads and ports, that would otherwise not be serializable. The decoder procedure
is typically the inverse of the encoder procedure, i.e. (decoder (encoder X)) = X.
For example:
> (define (make-adder x) (lambda (y) (+ x y)))
> (define f (make-adder 10))
> (define a (object->u8vector f))
> (define b (u8vector->object a))
> (u8vector-length a)
1639
> (f 5)
15
> (b 5)
15
> (pp b)
(lambda (y) (+ x y))
The byte representation is a fairly straightforward prefix representation that has been designed for compactness. Some common data, such as booleans and exact integers between 0 and 10, are represented with a single byte. Here are a few examples:
> (object->u8vector #f)
#u8(112)
> (object->u8vector #t)
#u8(113)
> (object->u8vector '())
#u8(114)
> (object->u8vector 0)
#u8(80)
> (object->u8vector 10)
#u8(90)
> (object->u8vector 11)
#u8(94 11)
> (object->u8vector (- (expt 2 31) 1))
#u8(91 255 255 255 127)
> (object->u8vector (expt 2 200))
#u8(95 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1)
> (object->u8vector "a")
#u8(17 97)
> (object->u8vector "abcdefghi")
#u8(25 97 98 99 100 101 102 103 104 105)
> (object->u8vector (cons 0 10))
#u8(100 80 90)
Marc