> I think if we want to allow uninterested readers to skip, we need to > decide how much to constrain the power of custom parsers for custom > types, lest finding the end of the expression turn out to be arbitrarily > complicated! This could either be: > > 1) Mandate that the syntax of the stuff in {} is actually the same as a > normal sexpr list () - a space-delimited list of sexprs. > > 2) Make it an arbitrary string, but with rules about how embedded }s are > represented - some combination of quoting mechanisms and saying that > balanced pairs of {} are OK. > > I prefer the former. Me too. IMNSHO, option 1 is a thousand times more preferable to option 2 :) > The latter might be better for really complex > grammars, if they can be bent to work with the mandated quoting rules - > I'm struggling to think of good examples, perhaps #xml{Hello > <em>world</em>!} would be handy for people writing Web applications. But > the former can do something nearly as good: #xml{"Hello <em>world</em>!"}. Embedded XML is a good illustration of why embedding custom lexers causes things to quickly turn chaotic. (But XML is not the worst example by any means - it can and will get more baffling. Mix PHP, HTML, CSS and/or JavaScript in the same file and things get real :) Not to judge, but there is a segment of people who absolutely love lexers and parsers, and will jump at the chance to re-invent a poorly-thought-out custom syntax for the simplest of jobs. Then the rest of us have to scratch our heads at how to parse those things and how to implement basic tools like code walkers and syntax highlighters. So based on both principle (complexity budget) and repeated experience, I'd be strongly in favor of having very simple core S-exps, building any complex types by nesting those, and banhammering all custom lexer extensions in the format :) Let people keep any lexer extensions in their own Lisp code, where they are the only ones who will have to deal with them. > I dimly recall being impressed by Erlang's binary literal syntax, let me > do some research and try to remember why... > > https://forfunand.wordpress.com/2011/10/10/why-erlangs-binary-syntax-is-awesome/ That is quite impressive as a pattern-matching language. > I guess the options are: > > 1) Hexadecimal > 2) Strings with quoting for unprintable/delimiter characters > 3) base64 or similar > 4) Some hybrid, where the contents of a bytevector literal are a series > of lexically distinguishable segments that could use any of the above, > and are concatenated together. Might even drop the quoting in strings > and force any non-printable or delimter characters to drop out of string > mode and go in hex. #u8("This is a null-terminated string" 0). #u8("This > is an embedded block of really random entropy: " > :GSVGxo89Ab6QX4D8l9KWzQ== " - I hope you like it.") - where I have > purely arbitrarily chosen ':' as a prefix to base64 values to > distinguish them from hex values. > > (2) makes bits of a bytevector that happen to also be valid ascii or > utf-8 text "readable", but is more complicated to generate/parse and > ends up as a worse form of (1) for very unprintable stuff. The thing is that character encoding is easily messed up by running "iconv". (This problem concerns ordinary strings too.) Permitting arbitrary text strings in bytevector literals may be desirable (I don't really have an opinion) but if it's permitted, I'd suggest limiting the characters set to printable ASCII which is hard to mess up via charset conversion and text editing. > (3) is dense and simple to process for machines, but is totally > meaningless to humans. Then again, hex is almost as meaningless unless you have ascended :) > (1) is like a watered down form of (3), a little easier for humans to > make some sense of (as a kid hacking around under MS-DOS, I quickly > learnt to read some parts of x86 machine code in hex... the phrase "CD > 21" is forever burned into my memory) Yes. Hex is like decaffeinated base64. > (4) is probably optimal from a human readability perspective, IF the > encoder makes smart choices about what encodings to use where, but is > the most complicated to implement. The challenge is to keep the bit density of hex or base64 while also permitting human-readable ASCII strings to be interspersed with them. The obvious one doesn't work: #u8("hello" "0d0a" "world") We could invent a prefix for hex strings, but that's a bit dodgy: #u8("hello" #"0d0a" "world") #u8("hello" #u8"0d0a" "world") Or just give up with bit density and go back to Scheme hex numbers: #u8("hello" #x0d #x0a "world") The thing is that readability and parsing speed also suffer. The lost bit density is a lesser problem than those IMHO.