Attempt at a stack of data formats to make everyone happy Lassi Kortela (19 Sep 2019 17:28 UTC)
Sketching the format stack Lassi Kortela (19 Sep 2019 18:07 UTC)
Re: Attempt at a stack of data formats to make everyone happy Arthur A. Gleckler (20 Sep 2019 22:19 UTC)
Re: Attempt at a stack of data formats to make everyone happy Alaric Snell-Pym (24 Sep 2019 09:02 UTC)
Core S-expression and binary formats John Cowan (24 Sep 2019 14:49 UTC)
Re: Core S-expression and binary formats John Cowan (25 Sep 2019 02:14 UTC)
Sharpsign syntax for hashtables, sets, bytevectors, etc. Lassi Kortela (25 Sep 2019 08:26 UTC)
Bytevector literals Lassi Kortela (25 Sep 2019 08:38 UTC)
Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Alaric Snell-Pym (25 Sep 2019 09:33 UTC)
Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Lassi Kortela (25 Sep 2019 09:53 UTC)
Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Alaric Snell-Pym (25 Sep 2019 10:32 UTC)
String literals inside bytevector literals Lassi Kortela (25 Sep 2019 10:46 UTC)
A S-expression syntax that can carry all this stuff Lassi Kortela (19 Sep 2019 20:01 UTC)

Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Alaric Snell-Pym 25 Sep 2019 09:33 UTC
On 25/09/2019 09:25, Lassi Kortela wrote:
> It might be nice if the hash syntax is always #symbol{...} which is a
> mash-up of all our suggestions so far. And if the symbol could
> optionally be reverse-DNS, so I could make #io.lassi.whizbang{...} for
> my own syntax extensions without disturbing others, and the {...}
> wrapping would guarantee an easy way for uninterested (machine and
> human) readers to skip it.

Ah, yes, I forgot to add a private extension mechanism when talking
about having an SRFI-driven registry - I'm getting sloppy!

I think if we want to allow uninterested readers to skip, we need to
decide how much to constrain the power of custom parsers for custom
types, lest finding the end of the expression turn out to be arbitrarily
complicated! This could either be:

1) Mandate that the syntax of the stuff in {} is actually the same as a
normal sexpr list () - a space-delimited list of sexprs.

2) Make it an arbitrary string, but with rules about how embedded }s are
represented - some combination of quoting mechanisms and saying that
balanced pairs of {} are OK.

I prefer the former. The latter might be better for really complex
grammars, if they can be bent to work with the mandated quoting rules -
I'm struggling to think of good examples, perhaps #xml{Hello
<em>world</em>!} would be handy for people writing Web applications. But
the former can do something nearly as good: #xml{"Hello <em>world</em>!"}.

>> As for bytevectors, I'll add them using R7RS syntax.  Should we require
>> hex?  It's easier to read, if wasteful.  Chibi outputs every value in hex
>> except 0, which is exceedingly common.
>
> Do you mean this: #u8(1 2 3 4 5). I find it very nice. However, the bit
> density is low and as you say, decimal numbers cause concern.

I dimly recall being impressed by Erlang's binary literal syntax, let me
do some research and try to remember why...

https://forfunand.wordpress.com/2011/10/10/why-erlangs-binary-syntax-is-awesome/

...Ok, it's because the syntax allows for a patter matching form with
embedded variables which is really neat for deconstructing and
constructing fixed-width binary network packets and the like, not
relevant here!

I guess the options are:

1) Hexadecimal
2) Strings with quoting for unprintable/delimiter characters
3) base64 or similar
4) Some hybrid, where the contents of a bytevector literal are a series
of lexically distinguishable segments that could use any of the above,
and are concatenated together. Might even drop the quoting in strings
and force any non-printable or delimter characters to drop out of string
mode and go in hex. #u8("This is a null-terminated string" 0). #u8("This
is an embedded block of really random entropy: "
:GSVGxo89Ab6QX4D8l9KWzQ== " - I hope you like it.") - where I have
purely arbitrarily chosen ':' as a prefix to base64 values to
distinguish them from hex values.

(2) makes bits of a bytevector that happen to also be valid ascii or
utf-8 text "readable", but is more complicated to generate/parse and
ends up as a worse form of (1) for very unprintable stuff.

(3) is dense and simple to process for machines, but is totally
meaningless to humans.

(1) is like a watered down form of (3), a little easier for humans to
make some sense of (as a kid hacking around under MS-DOS, I quickly
learnt to read some parts of x86 machine code in hex... the phrase "CD
21" is forever burned into my memory)

(4) is probably optimal from a human readability perspective, IF the
encoder makes smart choices about what encodings to use where, but is
the most complicated to implement.

If I had to pick one... I'd be torn between (3) because it's simple and
compact and human readability of bytevectors isn't the biggest concern,
and the original R7RS format because it'd be a shame to have yet another
standard!

--
Alaric Snell-Pym   (M7KIT)
http://www.snell-pym.org.uk/alaric/