Attempt at a stack of data formats to make everyone happy Lassi Kortela 19 Sep 2019 17:28 UTC
OK, let's concede that this Scheme database mailing list has turned into the de facto Scheme data encoding mailing list as well :) We can rationalize this by saying that both topics deal with persistence. What we have learned so far: - Text is advantageous in some situations, binary in others. - Even when one has data in text/binary, it's often useful to be able to transform that data into binary/text for a particular job and then back again. In general, things are easier when picking text or binary doesn't pin you down to that choice forever such that you can't manipulate that data in the other kind of format without painful lossy conversions and manual munging. It's great when you can just say, "here's some textual data I have, make it binary" or "here's some binary data I have, make it textual" and the conversion is lossless and automatic. - That is easiest to accomplish by first designing _data models_ instead of encodings. So the model says what data types you have and what is the range of each type. Then simply design a pair of equivalent text and binary encodings for each data model so that all values in those types can be represented simply. - The data models are most naturally approaches as a stack of growing complexities. For simple jobs, desiring simple fast implementations and the widest interoperability, it's nice to have a data model like JSON's (or even simpler) with a small handful of universally useful data types. - For more complex jobs like pickling arbitrary data, it's useful to spec out more complex data models, conceding that the encodings will turn out a bit more complex as well. - Application-specific formats should be built on top of the above generic formats. There's a kind of Zawinski's law at work here: non-hierarchical application formats inevitably expand into hierarchical ones; those which cannot so expand are replaced by ones which can (or, more frequently, hierarchical extensions are violently shoehorned onto the staunchly non-hierarchical ones to create weird franken-formats). So what we need to do is: - Survey all the existing data representations around Lisp (at which task John has already made a fine start with his spreadsheet). - Then figure out a good stack of generic data models (from simple to complex) that serve a wide range of applications of different kinds and complexities. This can be done by subsetting the simple data models from the most complex uber-model. Maybe we can make do with about 3 models: trivial -> intermediate -> the kitchen sink. - When we have good stack of data models, design dual text and binary formats for each model. So that each pair of text and binary formats can represent exactly the same data, and round-tripping arbitrary data between text and binary is trivial and we can easily do it any time we please by calling on a standard library. - The stack of data models should be designed such that converting data from a more complex model to a simpler one is also reasonable in cases where the source data is entirely or mostly representable in the simpler target model already. Opinions?