binary format for Scheme data

binary format for Scheme data Marc Feeley 06 Oct 2019 19:33 UTC
Communication is an interesting topic that ties in to the work I have done on Termite Scheme, which is a language extension to Gambit meant for programming distributed systems.  In distributed systems it is important to have efficient serialization/deserialization operations, and in the context of Termite this means a binary format for representing Scheme data.

It would be nice to have a common binary format, at least for Scheme data with an external representation and cyclic/shared data so that data can be communicated between different implementations of Scheme.

Gambit has a compact format that can represent all Scheme data including procedures and continuations (useful for distributing code between nodes of a distributed system).  The procedures object->u8vector and u8vector->object perform the serialization and deserialization operations respectively:

  procedure (object->u8vector obj [encoder])
  procedure (u8vector->object u8vector [decoder])

    The procedure object->u8vector returns a u8vector that contains the sequence of bytes
    that encodes the object obj. The procedure u8vector->object decodes the sequence of
    bytes contained in the u8vector u8vector, which was produced by the procedure
    object->u8vector, and reconstructs an object structurally equal to the original object.
    In other words the procedures object->u8vector and u8vector->object respectively perform
    serialization and deserialization of Scheme objects. Note that some objects are
    non-serializable (e.g. threads, wills, some types of ports, and any object containing
    a non-serializable object).

    The optional encoder and decoder parameters are single parameter procedures which default
    to the identity function. The encoder procedure is called during serialization. As the
    serializer walks through obj, it calls the encoder procedure on each sub-object X that
    is encountered. The encoder transforms the object X into an object Y that will be serialized
    instead of X. Similarly the decoder procedure is called during deserialization. When an
    object Y is encountered, the decoder procedure is called to transform it into the object
    X that is the result of deserialization.

    The encoder and decoder procedures are useful to customize the serialized representation
    of objects. In particular, it can be used to define the semantics of serializing objects,
    such as threads and ports, that would otherwise not be serializable. The decoder procedure
    is typically the inverse of the encoder procedure, i.e. (decoder (encoder X)) = X.

    For example:

    > (define (make-adder x) (lambda (y) (+ x y)))
    > (define f (make-adder 10))
    > (define a (object->u8vector f))
    > (define b (u8vector->object a))
    > (u8vector-length a)
    1639
    > (f 5)
    15
    > (b 5)
    15
    > (pp b)
    (lambda (y) (+ x y))

The byte representation is a fairly straightforward prefix representation that has been designed for compactness.  Some common data, such as booleans and exact integers between 0 and 10, are represented with a single byte.  Here are a few examples:

> (object->u8vector #f)
#u8(112)
> (object->u8vector #t)
#u8(113)
> (object->u8vector '())
#u8(114)
> (object->u8vector 0)
#u8(80)
> (object->u8vector 10)
#u8(90)
> (object->u8vector 11)
#u8(94 11)
> (object->u8vector (- (expt 2 31) 1))
#u8(91 255 255 255 127)
> (object->u8vector (expt 2 200))
#u8(95 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1)
> (object->u8vector "a")
#u8(17 97)
> (object->u8vector "abcdefghi")
#u8(25 97 98 99 100 101 102 103 104 105)
> (object->u8vector (cons 0 10))
#u8(100 80 90)

Marc