I agree with the peril of feature bloat.

On the other hand, the problem of using S-expression declare-file form

is that it conflates meta-information into contents. Suppose you have a

config file whose format is a sequence of arbitrary S-expressions. How do

you attach encoding declaration to it? Note that the config directive may happen to

begin with a symbol 'declare-file', in application-specific semantics.

READ should read arbitrary (syntactically valid) S-expressions, without interpreting

its semantics, because interpretation is up to the caller of READ.

The #!-identifier, and srfi-10 #,ctor syntax, specifically exist to communicate with READ

out-of-band from the S-expressions. If we want to piggy-back with existing mechanism,

we can use either one of them. The encoding information, however, is tricker than

#!case-fold or #,(construuctor arg ...), since you can't read it as a text until you know

the encoding. So even we adopt one of the existing reader syntax, encoding recogniztion

will likely to be implemented separately from the existing reader syntax handling mechanism.

Well, if there's not so much agreement on this, I don't see its worth to standardize; we'll

probably be able to stick with utf-8.

On Sun, May 12, 2019 at 9:44 AM Lassi Kortela <xxxxxx@lassi.io> wrote:

> - Probably we want something like #!coding=<value>, rather than just
> #!<identifier>. E.g. #!coding=utf-8 instead of #!utf-8.
> The latter can eat up #!-namespace quickly.

This is why I suggested using ordinary S-expressions. When we start with
a simple syntax like #!symbol, eventually we'll want #!symbol=value.
Then #!symbol1=value1&symbol2=value2 would be nice. Later we want nested
data and end up with a contrived alternative syntax for S-expressions,
when we could just have used them from the start :)

This is like a syntax version of Zawinski's Law ("Every syntax expands
until it can encode nested lists and named fields") or Greenspun’s Tenth
Rule ("Any sufficiently complicated custom syntax contains an ad-hoc
notation for S-expressions").

Similar comments have been made about network protocols. "I know: I'll
encode only the fields I need. That way it will be simple." A decade
later, extensions are buggy and hairy. The notoriously complex RFC 822
could have just been an S-expression.

I have done this as many times as everyone else. "I know, this time I'll
get it _right_! I'll only specify what I need. It will be simple!" After
failing every time, I have now mostly learned my lesson, and no longer
use any format that doesn't have nested lists and named fields :)

An even more insidious thing is that when you have a limited syntax, you
don't think about useful new extensions, or subconsciously ignore them
because you feel how painful it would be to extend the syntax. So useful
tools do not get built for decades. For example, adding nested data to
the #!a=b&c=d syntax would probably make sense, but the syntax would
look so strained that people subconsciously dismiss any such thought.

> One issue of S-expression metadata is that S-expression reader is more
> involved than a simple finite automata

S-expressions are more code to implement up-front than a custom format
(if you already have a regexp library for the latter), but they are more
uniform and reliable and the syntax will never get more complex no
matter how much it's extended. So you pay the cost only once.

> - Recognizing coding[=:]<value> (without #!) can work with the editor.
> In Emacs, adding -*- coding: <coding> -*- immediately
> switches the buffer encoding automatically.

It's true that this is a major advantage of the regexps patterns.

In Emacs one can use 'auto-coding-functions' and 'auto-coding-alist' to
add detectors. Conceivably a detector could come with the standard
scheme-mode.

A magic comment can also be shared with other languages. But there is no
principled standard syntax to which all/most magic comments conform and
it's unlikely that such a standard will emerge.

> But I agree that it's getting moot as utf-8 dominates. The only
> concern is that, since R7RS-small doesn't require full unicode
> support, the R7RS implementation that only supports ascii needs a way to
> reject source code using greek-lambda gracefully.

The spread of UTF-8 does indeed make our lives easier :)

The Unicode lambda could come as a macro from (import (srfi xyz)).
Non-Unicode implementations would not support that SRFI so the import
would fail before parsing gets to the lambda.