Email list hosting service & mailing list manager

Attempt at a stack of data formats to make everyone happy Lassi Kortela (19 Sep 2019 17:28 UTC)
Sketching the format stack Lassi Kortela (19 Sep 2019 18:07 UTC)
Re: Attempt at a stack of data formats to make everyone happy Arthur A. Gleckler (20 Sep 2019 22:19 UTC)
Re: Attempt at a stack of data formats to make everyone happy Alaric Snell-Pym (24 Sep 2019 09:02 UTC)
Core S-expression and binary formats John Cowan (24 Sep 2019 14:49 UTC)
Re: Core S-expression and binary formats John Cowan (25 Sep 2019 02:14 UTC)
Sharpsign syntax for hashtables, sets, bytevectors, etc. Lassi Kortela (25 Sep 2019 08:26 UTC)
Bytevector literals Lassi Kortela (25 Sep 2019 08:38 UTC)
Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Alaric Snell-Pym (25 Sep 2019 09:33 UTC)
Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Lassi Kortela (25 Sep 2019 09:53 UTC)
Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Alaric Snell-Pym (25 Sep 2019 10:32 UTC)
String literals inside bytevector literals Lassi Kortela (25 Sep 2019 10:46 UTC)
A S-expression syntax that can carry all this stuff Lassi Kortela (19 Sep 2019 20:01 UTC)

Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Lassi Kortela 25 Sep 2019 09:53 UTC

> I think if we want to allow uninterested readers to skip, we need to
> decide how much to constrain the power of custom parsers for custom
> types, lest finding the end of the expression turn out to be arbitrarily
> complicated! This could either be:
>
> 1) Mandate that the syntax of the stuff in {} is actually the same as a
> normal sexpr list () - a space-delimited list of sexprs.
>
> 2) Make it an arbitrary string, but with rules about how embedded }s are
> represented - some combination of quoting mechanisms and saying that
> balanced pairs of {} are OK.
>
> I prefer the former.

Me too. IMNSHO, option 1 is a thousand times more preferable to option 2 :)

> The latter might be better for really complex
> grammars, if they can be bent to work with the mandated quoting rules -
> I'm struggling to think of good examples, perhaps #xml{Hello
> <em>world</em>!} would be handy for people writing Web applications. But
> the former can do something nearly as good: #xml{"Hello <em>world</em>!"}.

Embedded XML is a good illustration of why embedding custom lexers
causes things to quickly turn chaotic. (But XML is not the worst example
by any means - it can and will get more baffling. Mix PHP, HTML, CSS
and/or JavaScript in the same file and things get real :)

Not to judge, but there is a segment of people who absolutely love
lexers and parsers, and will jump at the chance to re-invent a
poorly-thought-out custom syntax for the simplest of jobs. Then the rest
of us have to scratch our heads at how to parse those things and how to
implement basic tools like code walkers and syntax highlighters. So
based on both principle (complexity budget) and repeated experience, I'd
be strongly in favor of having very simple core S-exps, building any
complex types by nesting those, and banhammering all custom lexer
extensions in the format :) Let people keep any lexer extensions in
their own Lisp code, where they are the only ones who will have to deal
with them.

> I dimly recall being impressed by Erlang's binary literal syntax, let me
> do some research and try to remember why...
>
> https://forfunand.wordpress.com/2011/10/10/why-erlangs-binary-syntax-is-awesome/

That is quite impressive as a pattern-matching language.

> I guess the options are:
>
> 1) Hexadecimal
> 2) Strings with quoting for unprintable/delimiter characters
> 3) base64 or similar
> 4) Some hybrid, where the contents of a bytevector literal are a series
> of lexically distinguishable segments that could use any of the above,
> and are concatenated together. Might even drop the quoting in strings
> and force any non-printable or delimter characters to drop out of string
> mode and go in hex. #u8("This is a null-terminated string" 0). #u8("This
> is an embedded block of really random entropy: "
> :GSVGxo89Ab6QX4D8l9KWzQ== " - I hope you like it.") - where I have
> purely arbitrarily chosen ':' as a prefix to base64 values to
> distinguish them from hex values.
>
> (2) makes bits of a bytevector that happen to also be valid ascii or
> utf-8 text "readable", but is more complicated to generate/parse and
> ends up as a worse form of (1) for very unprintable stuff.

The thing is that character encoding is easily messed up by running
"iconv". (This problem concerns ordinary strings too.)

Permitting arbitrary text strings in bytevector literals may be
desirable (I don't really have an opinion) but if it's permitted, I'd
suggest limiting the characters set to printable ASCII which is hard to
mess up via charset conversion and text editing.

> (3) is dense and simple to process for machines, but is totally
> meaningless to humans.

Then again, hex is almost as meaningless unless you have ascended :)

> (1) is like a watered down form of (3), a little easier for humans to
> make some sense of (as a kid hacking around under MS-DOS, I quickly
> learnt to read some parts of x86 machine code in hex... the phrase "CD
> 21" is forever burned into my memory)

Yes. Hex is like decaffeinated base64.

> (4) is probably optimal from a human readability perspective, IF the
> encoder makes smart choices about what encodings to use where, but is
> the most complicated to implement.

The challenge is to keep the bit density of hex or base64 while also
permitting human-readable ASCII strings to be interspersed with them.

The obvious one doesn't work: #u8("hello" "0d0a" "world")

We could invent a prefix for hex strings, but that's a bit dodgy:

#u8("hello" #"0d0a" "world")
#u8("hello" #u8"0d0a" "world")

Or just give up with bit density and go back to Scheme hex numbers:

#u8("hello" #x0d #x0a "world")

The thing is that readability and parsing speed also suffer. The lost
bit density is a lesser problem than those IMHO.