Re: Attempt at a stack of data formats to make everyone happy

Show/hide message thread

Attempt at a stack of data formats to make everyone happy Lassi Kortela (19 Sep 2019 17:28 UTC)

Sketching the format stack Lassi Kortela (19 Sep 2019 18:07 UTC)

Re: Attempt at a stack of data formats to make everyone happy Lassi Kortela (19 Sep 2019 19:43 UTC)

Re: Attempt at a stack of data formats to make everyone happy Lassi Kortela (19 Sep 2019 19:44 UTC)

Re: Attempt at a stack of data formats to make everyone happy John Cowan (19 Sep 2019 20:19 UTC)

Re: Attempt at a stack of data formats to make everyone happy John Cowan (20 Sep 2019 20:59 UTC)

Re: Attempt at a stack of data formats to make everyone happy Arthur A. Gleckler (20 Sep 2019 22:19 UTC)

Re: Attempt at a stack of data formats to make everyone happy Alaric Snell-Pym (24 Sep 2019 09:02 UTC)

Re: Attempt at a stack of data formats to make everyone happy Lassi Kortela (24 Sep 2019 09:29 UTC)

Core S-expression and binary formats John Cowan (24 Sep 2019 14:49 UTC)

Re: Core S-expression and binary formats John Cowan (25 Sep 2019 02:14 UTC)

Sharpsign syntax for hashtables, sets, bytevectors, etc. Lassi Kortela (25 Sep 2019 08:26 UTC)

Bytevector literals Lassi Kortela (25 Sep 2019 08:38 UTC)

Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Alaric Snell-Pym (25 Sep 2019 09:33 UTC)

Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Lassi Kortela (25 Sep 2019 09:53 UTC)

Re: Sharpsign syntax for hashtables, sets, bytevectors, etc. Alaric Snell-Pym (25 Sep 2019 10:32 UTC)

String literals inside bytevector literals Lassi Kortela (25 Sep 2019 10:46 UTC)

A S-expression syntax that can carry all this stuff Lassi Kortela (19 Sep 2019 20:01 UTC)

Re: Attempt at a stack of data formats to make everyone happy Lassi Kortela 24 Sep 2019 09:29 UTC

>> https://bitbucket.org/cowan/r7rs-wg1-infra/src/default/CoreSexps.md is the
>> next stab at core S-expressions.

Thank you for writing it up. Did I review it already? Here is some
riffing on Alaric's comments:

> I'd be inclined to remove the thing that numbers outside of ranges may
> not interoperate.

I would also like to remove all hints at numerical limits from even the
simplest specs. That makes life so much simpler, because any limits we
suggest are arbitrary and tied to the particular decade in which are
writing the spec. Implementors normally have a particular set of number
types to work with anyway, handed to them by the C compiler or Lisp
system; nothing we can write in the spec will change that set of types.

Concretely, I'd remove all mentions of "may not interoperate".

> 1. How SHOULD one represent arbitrary numbers when they crop up in the
> problem domain, then? Define a bignum format as a list of 64-bit
> integers and have code to convert between them and proper numbers? Ugh!
>
> 2. People will forget about the restriction when using systems that
> support bignums, which will work happily in their testing, but break in
> undefined ways when interoperated with arbitrary third-party systems. Ugh!

I agree with both ughs :)

IMHO we should take a page from "The Right Thing" here, and specify it
so the interface is simple at the expense of implementation complexity.
So bignums are encoded exactly the same way as fixnums, using decimal
digits.

> Now, given that CoreSexps adds a new syntax #{ ... },

This syntax is quite nice, but I'd think about it some more. In
particular, Racket already has a different read syntax for hash-tables,
as does Clojure's EDN.

With curly braces, there's also the usual problem with sets vs maps.
Braces naturally represent both, so a fight ensues, and no solution is
typically ideal.

> I don't think
> there's any point in trying to make it "compatible" with (read) on any
> existing Lisp by avoiding syntax that "might cause problems"; arbitrary
> data shouldn't be fed into (read) in most cases due to syntax in very
> many lisp implementations that can execute arbitrary code!

I agree with this stance. It's nice to be compatible where possible (so
that core S-exps can be fed to "read" when working with known files you
happend to have at hand). But for systems dealing with unknown data,
under no circumstances would I recommend using (read) to read core
sexps. I mean, our own reader is going to be like 100 lines of Lisp. We
can just ship it as a library for every dialect. For the binary varint
sexps, it took half an hour to write a library for a new language! I'd
be surprised if textual ones take more than an hour per language.

That being said, we should ship a standard test suite so new
implementations can be verified quickly. Code written in half an hour is
generally not bulletproof :)

> So I think it should be "compatible with s-expressions" for *human*
> purposes (not needing to learn a new language), and perhaps to allow the
> s-expression syntax of RnRS to become a superset of it in time (we can't
> back-fill a written syntax for hash tables into R7RS now, alas) so that
> CoreSexp literals can be written as-is in RnRS programs. But trying to
> find a lowest common denominator of s-expression syntaxes is, I think, a
> flawed approach, even if we then didn't leap straight out of that subset
> by extending it with #{ ... }!

Very well thought out. I completely agree with all points.

> So my suggestion would be:
>
> 1) Take the s-expression syntax from R7RS, which IIRC has no remote code
> execution defined in the standard (as opposed to CL's); but remove the '
> ` , ,@ syntactic sugars that just expand into (quote ...) and friends
> anyway.

I also like #t and #f (which are also in John's current spec). It's not
ideal but the alternatives are much worse. I just implemented the binary
sexp library for Common Lisp and the NIL/()/false problem came out
immediately. It's always nice not to have to go there :)

> 1a) I'm not sure if we should remove improper lists from the syntax...
> It would be nice to be able to have non-Lisp implementations of this
> model able to assume that lists are proper lists and can map to their
> own list types.

This is a hard problem. Dotted pairs look and act a bit dodgy, but on
the other hand, a cons cell can be considered a fundamental building
block of hierarchical information, with not too much hyperbole.

Dotted pairs also make interoperability with just about every
non-Lisp/non-functional language harder, since those seldom have native
cons cells. In my Python reader, I just read successive conses into a
Python list and raise an error when encountering an improper list.
That's not too bad, since the writer can opt to not send any improper
lists, but it was the only tricky part in the reader.

Could we leave them out at first, write a few programs that use core
sexps, and find out if we miss them?

> 2) Add syntax for arbitary types, perhaps of the form #NAME{ ... };
> where NAME is a registry extended via SRFIs... hash tables are
> important/common enough to claim the empty NAME and be written as #{key
> val key val}, time objects can get #time{TYPE SECONDS NANOSECONDS}, etc.

I'd also tentatively vote for #name. We're going to need full words
after the sharpsign -- one letter won't cut it :)

> 3) Define an SRFI with "safe" read and write procedures that read and
> write exactly this language, and also with a procedure to register
> arbitrary type readers/writers so the arbitrary type list can be
> extended by portable SRFI implementations.
>
> Other languages can have their own implementations like that SRFI, doing
> their best to map from our types into theirs.

+1

>>> Here are my suggestions for rock-bottom S-expressions:
>>>
>>> Proper lists as we know them.  They might turn into vectors in non-Lisp
>>> systems.
>>>
>>> Alists as we know them.  They might turn into hashtables or dictionaries
>>> in non-Lisp systems.  We always format an alist element (1 2 3) as (1 . (2
>>> 3)).
>
> How can we tell if an alist is an alist when writing? It's all just cons
> cells and atoms... I'd prefer to use hash tables here, which can be
> unambigously detected and written as #{ ... } under CoreSexps syntax.

I would leave out all the magic about special handling for alists.
That's domain knowledge, higher-level than this encoding IMHO.