On Wed, Sep 25, 2019 at 6:15 AM Lassi Kortela <xxxxxx@lassi.io> wrote:

Another proposal, this one for distilling a core lexical syntax:

I think we need to think about what the purpose of the core SRFI spec is.  My notion is that its most important values are  simplicity, human readability, and stability. (In contrast, the values of a binary spec are compactness, computer-friendliness, and extensibility.)  I base this on lessons learned from XML and JSON. 

XML 1.0 was a fixed spec.  For reasons that still seem basically sound to me, I proposed and pushed through the W3C the XML 1.1 spec, widening the scope of identifiers to whatever Unicode allows at any given time rather than just what was allowed in 1996 (to use Romanian identifiers you had to misspell them, and Ethiopic-script identifiers were completely unusable).  There were also a few other changes, and all looked good to me.

Well, XML 1.1 was a complete flop.  There's no way to know how many documents were ever written in it, but I bet the answer is "Damned few".  I was able to get the identifier improvements put through as an erratum, stretching the definition of "erratum" to its limit, and that's it.  I feel sure that XML will never be changed again.  Even the tiny subset MicroXML that James Clark and I put together with an unofficial group (it never became a W3C spec) probably doesn't have any legs: the cost of full XML, which is at least five times bigger than MicroXML plus all the superimposed specs like namespaces, is already fully amortized.  People hate XML because of the misuse it's been put to (it was designed for documents that eventually people would read, and there is still no substitute for it for that purpose), they use it.

JSON was luckier.  It was meant for interchange and it was designed to be as simple as possible (but no simpler).  As XML was constrained by backward compatibility with SGML (a huge standard), JSON was constrained by backward compatibility with JavaScript literals.  It makes no provision for versioning, and so it is forever fixed.  As a consequence, it's become very widespread: the home page at json.org lists 166 parsers in 58 languages (including Fortran, Cobol, and Prolog) and is still woefully incomplete (it lists only Racket's parser under Scheme, for instance).  But anyone can write a JSON parser, and once it's done, it's done: it will never have to be updated or rewritten in an attempt to handle more features that other systems might send it.  Those who want or need that kind of extensibility have had to create other standards, many of them.  YAML is perhaps the best known of these, and although it is backward compatible with JSON, the average YAML file looks absolutely nothing like JSON.  Even modest improvements like allowing unquoted strings as dictionary keys or adding JavaScript syntax for infinities, negative zero, and NaN never go anywhere.

So we Lispers have this S-expression technology, which combines some advantages of both its competitors.  It too started very simply:  proper and improper lists, integers and floats, and symbols, period.  Now it has become very baroque.  The standard lexical syntax of Common Lisp adds numeric bases and ratios (same as Scheme) as well as complex numbers (different from Scheme).  It also has support for quotes and quasiquotes (the latter expand to arbitrary code in CL rather than a fixed representation as in CL), datum labels, single-line and arbitrary-length comments (but not S-expression comments), strings, vectors, bitvectors, uninterned symbols, characters, read-time conditionalization, read-time evaluation (dangerous, can be turned off), arrays, pathnames, and structures (introspectable records).  Many of these things and distinctions between them make no sense in other languages: they will be hard for other kinds of programmers to input or generate.

In short, in S-expressions, human readabiity has been kept, but stability exists only because CL is a stable language spec, and simplicity is gone forever.  Even then, CL allows you to randomly change the readtable and specify print-object methods, so that it's quite likely that data output in S-expression format from one CL program will be completely unintelligible to another CL program unless they have the same author.  (The Curse of Lisp, transposed to the key of data.)

Integer         [-]1234 #[box][-1]1234

I don't understand the notation for integers.  What is "box"?  What is the -1 doing there?
 
Ratio           [-]1234/5678

Although exact ratio libraries are available, very few languages have them integrated into their numeric tower, for the good reason that they don't have that many applications.  There are anecdotes about numerical Lisp programs being made 10 to 100 times faster by introducing a single decimal point at a strategic place, because it happened that the inputs were integers, so all the arithmetic was exact, with denominators getting longer and longer as the computations went on.  With at least one inexact constant, everything then ran at floating-point speeds.

Real is fine, except that support for infinity and NaN would be a Good Thing.  There is nothing resembling a standard here (even though the internal values are completely standard), so we might as well use our R6RS/R7RS notation.
 
Plain symbol    [A-Za-z][A-Za-z0-9-_.]*
Quoted symbol   |anything here|

The main purposes of symbols in a protocol are enumerated flags and keys in dictionaries.  Part of the lesson of XML 1.1 is that the world doesn't need this kind of names to be written in Ethiopic script.  As I pointed out before, historical quoted symbols (which existed already in Lisp 1.5 in the form $$/symbol name/, where / could be anything, were used as poor man's strings.  We have real strings now, let's stick to them.  (Standard ASN.1 has an enumeration type, but on the wire it's just an integer with a different type code; only the schema  knows the semantics of those integers, so I left it out of LER. which is intended to be schemaless.)

The reason I'd like to see a character that can only be used as the first character (I picked $, but that's arbitrary; it's often used by convention in JSON) is that in addition to dictionaries with fixed keys, there are also dictionaries with application-specific keys.  Even in these, however, some fixed keys are often necessary, things like an identifier for the dictionary, and having a conventionally reserved space to draw such meta-identifiers from is a Good Thing.

The hyphen and underscore situation is a bit different.  Lisp programmers like hyphens in identifiers, as do Cobol programmers (even though Cobol has infix minus, which means you have to use spaces around it), because our languages date back to early punch card systems where underscore did not exist.  (In early Fortran, identifiers were limited to six characters and you didn't waste any of them on internal delimiters, or rather IDLTRS.)  The younger languages, if they allow internal delimeters at all, use underscore, thus clearly separating it from infix minus.

Nobody really needs _two_ internal delimiters, so I suggest that we either allow only "-" and leave it up to non-Lisp systems to change it to "_" if they are happier with that, or allow both but warn sternly against using both foo-bar and foo_bar for different purposes.
 
Improper list   (exprs . expr)

*Nobody* outside the Lisp community knows what this is.  Even languages with linked-list support internally almost always allow only a pair or () in the cdr slot.  To make it interoperate would require non-Lisp implementations to wrap their native array, vector, or list (in the case of Python) type in an opaque record type wrapper, which would block the use of native operations on it.  It's nothing but a nuisance to them.  And for data interchange as opposed to serialization, who cares about the difference between (a b) and (a . b) anyway, even in Lisp?  The extra pair is esssentially free.  (I have added an improper-list type to ASN.1 LER, primarily for serialization.)

Vector          #{exprs}
Character       #\a #\newline #\x1234

These also, to a lesser degree, are distinctions without a difference.  Lists vs. vectors?  Characters vs. strings of length 1?  Who cares?  One type is enough for each: general-purpose sequence and sequence-of-Unicode-codepoints.
 
Special value   #!any-plain-symbol
Special type    #any-plain-symbol{exprs}

These are good on human readability and simplicity, but not so good for stability.  Are they really safe to ignore if you don't understand them?  (I wish ASN.1 had such a must-understand flag.)  In practice, people will use these to refer to concepts that other programmers don't get and will misunderstand or misuse.  Worse, one group may use #hash and another #dict, and who will know that they are the same concept?  As mentioned above, when a JSON reader or writer is finished, it's *finished*, and that's a great thing.  A list whose first element is a meta-symbol like $hash should be good enough; if you don't expect a list in a particular place, you blow up at the semantic level rather than in the parser.  A registry would then just be a matter of convenience ("is there a usual way to represent this?") rather than a necessity.

However, I think JSON has shown that dictionaries are important enough that they should be first-class.  I currently recommend {key value ...}. 

Line comments should be provided: they are one of the things often asked for in JSON.



John Cowan          http://vrici.lojban.org/~cowan        xxxxxx@ccil.org
Evolutionary psychology is the theory that men are nothing but horn-dogs,
and that women only want them for their money.  --Susan McCarthy (adapted)