Attempt at a stack of data formats to make everyone happy

Show/hide message thread

Attempt at a stack of data formats to make everyone happy Lassi Kortela (19 Sep 2019 17:28 UTC)

Sketching the format stack Lassi Kortela (19 Sep 2019 18:07 UTC)

Re: Attempt at a stack of data formats to make everyone happy Lassi Kortela (19 Sep 2019 19:43 UTC)

Re: Attempt at a stack of data formats to make everyone happy Lassi Kortela (19 Sep 2019 19:44 UTC)

Re: Attempt at a stack of data formats to make everyone happy John Cowan (19 Sep 2019 20:19 UTC)

Re: Attempt at a stack of data formats to make everyone happy John Cowan (20 Sep 2019 20:59 UTC)

Re: Attempt at a stack of data formats to make everyone happy Arthur A. Gleckler (20 Sep 2019 22:19 UTC)

Re: Attempt at a stack of data formats to make everyone happy Alaric Snell-Pym (24 Sep 2019 09:02 UTC)

Re: Attempt at a stack of data formats to make everyone happy Lassi Kortela (24 Sep 2019 09:29 UTC)

Core S-expression and binary formats John Cowan (24 Sep 2019 14:49 UTC)

Re: Core S-expression and binary formats John Cowan (25 Sep 2019 02:14 UTC)

Sharpsign syntax for hashtables, sets, bytevectors, etc. Lassi Kortela (25 Sep 2019 08:26 UTC)

Bytevector literals Lassi Kortela (25 Sep 2019 08:38 UTC)

Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Alaric Snell-Pym (25 Sep 2019 09:33 UTC)

Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Lassi Kortela (25 Sep 2019 09:53 UTC)

Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Alaric Snell-Pym (25 Sep 2019 10:32 UTC)

String literals inside bytevector literals Lassi Kortela (25 Sep 2019 10:46 UTC)

A S-expression syntax that can carry all this stuff Lassi Kortela (19 Sep 2019 20:01 UTC)

Attempt at a stack of data formats to make everyone happy Lassi Kortela 19 Sep 2019 17:28 UTC

OK, let's concede that this Scheme database mailing list has turned into
the de facto Scheme data encoding mailing list as well :) We can
rationalize this by saying that both topics deal with persistence.

What we have learned so far:

- Text is advantageous in some situations, binary in others.

- Even when one has data in text/binary, it's often useful to be able to
transform that data into binary/text for a particular job and then back
again. In general, things are easier when picking text or binary doesn't
pin you down to that choice forever such that you can't manipulate that
data in the other kind of format without painful lossy conversions and
manual munging. It's great when you can just say, "here's some textual
data I have, make it binary" or "here's some binary data I have, make it
textual" and the conversion is lossless and automatic.

- That is easiest to accomplish by first designing _data models_ instead
of encodings. So the model says what data types you have and what is the
range of each type. Then simply design a pair of equivalent text and
binary encodings for each data model so that all values in those types
can be represented simply.

- The data models are most naturally approaches as a stack of growing
complexities. For simple jobs, desiring simple fast implementations and
the widest interoperability, it's nice to have a data model like JSON's
(or even simpler) with a small handful of universally useful data types.

- For more complex jobs like pickling arbitrary data, it's useful to
spec out more complex data models, conceding that the encodings will
turn out a bit more complex as well.

- Application-specific formats should be built on top of the above
generic formats. There's a kind of Zawinski's law at work here:
non-hierarchical application formats inevitably expand into hierarchical
ones; those which cannot so expand are replaced by ones which can (or,
more frequently, hierarchical extensions are violently shoehorned onto
the staunchly non-hierarchical ones to create weird franken-formats).

So what we need to do is:

- Survey all the existing data representations around Lisp (at which
task John has already made a fine start with his spreadsheet).

- Then figure out a good stack of generic data models (from simple to
complex) that serve a wide range of applications of different kinds and
complexities. This can be done by subsetting the simple data models from
the most complex uber-model. Maybe we can make do with about 3 models:
trivial -> intermediate -> the kitchen sink.

- When we have good stack of data models, design dual text and binary
formats for each model. So that each pair of text and binary formats can
represent exactly the same data, and round-tripping arbitrary data
between text and binary is trivial and we can easily do it any time we
please by calling on a standard library.

- The stack of data models should be designed such that converting data
from a more complex model to a simpler one is also reasonable in cases
where the source data is entirely or mostly representable in the simpler
target model already.

Opinions?