Re: Unicode lambda - Simplelists

Show/hide message thread

Unicode lambda Lassi Kortela (12 May 2019 10:19 UTC)

Re: Unicode lambda Shiro Kawai (12 May 2019 11:18 UTC)

Re: Unicode lambda Lassi Kortela (12 May 2019 11:40 UTC)

Re: Unicode lambda Lassi Kortela (12 May 2019 11:50 UTC)

Re: Unicode lambda Shiro Kawai (12 May 2019 12:06 UTC)

Re: Unicode lambda Marc Nieper-Wißkirchen (12 May 2019 12:11 UTC)

Re: Unicode lambda Lassi Kortela (12 May 2019 12:23 UTC)

Re: Unicode lambda Lassi Kortela (12 May 2019 13:23 UTC)

Re: Unicode lambda Lassi Kortela (12 May 2019 13:46 UTC)

Re: Unicode lambda John Cowan (12 May 2019 14:20 UTC)

Re: Unicode lambda Lassi Kortela (12 May 2019 14:38 UTC)

Re: Unicode lambda Lassi Kortela (12 May 2019 14:55 UTC)

Re: Unicode lambda John Cowan (12 May 2019 15:00 UTC)

Re: Unicode lambda Lassi Kortela (12 May 2019 15:20 UTC)

Re: Unicode lambda Shiro Kawai (12 May 2019 18:42 UTC)

Re: Unicode lambda Lassi Kortela (12 May 2019 19:43 UTC)

Re: Unicode lambda John Cowan (12 May 2019 22:29 UTC)

Re: Unicode lambda Shiro Kawai (13 May 2019 10:48 UTC)

Re: Unicode lambda Lassi Kortela (14 May 2019 08:25 UTC)

Re: Unicode lambda Marc Nieper-Wißkirchen (14 May 2019 08:50 UTC)

Re: Unicode lambda Lassi Kortela (14 May 2019 10:10 UTC)

Re: Unicode lambda Lassi Kortela (14 May 2019 10:59 UTC)

Re: Unicode lambda Lassi Kortela (14 May 2019 12:35 UTC)

Re: Unicode lambda Lassi Kortela (14 May 2019 13:09 UTC)

Re: Unicode lambda Lassi Kortela (14 May 2019 14:04 UTC)

Re: Unicode lambda Shiro Kawai (14 May 2019 19:18 UTC)

Re: Unicode lambda Vincent Manis (14 May 2019 22:01 UTC)

Re: Unicode lambda Lassi Kortela (20 May 2019 09:21 UTC)

Re: Unicode lambda Marc Nieper-Wißkirchen (21 Oct 2019 14:20 UTC)

Re: Unicode lambda Shiro Kawai (21 Oct 2019 17:19 UTC)

Re: Unicode lambda John Cowan (21 Oct 2019 17:39 UTC)

Re: Unicode lambda Marc Nieper-Wißkirchen (21 Oct 2019 18:43 UTC)

Re: Unicode lambda John Cowan (21 Oct 2019 23:27 UTC)

Encoding declarations Lassi Kortela (22 Oct 2019 08:39 UTC)

Re: Encoding declarations John Cowan (22 Oct 2019 20:52 UTC)

#! directives, general and specific Lassi Kortela (22 Oct 2019 09:11 UTC)

Re: #! directives, general and specific John Cowan (22 Oct 2019 20:27 UTC)

Re: #! directives, general and specific Lassi Kortela (22 Oct 2019 20:43 UTC)

Re: Unicode lambda Marc Nieper-Wißkirchen (13 May 2019 08:50 UTC)

Re: Unicode lambda Lassi Kortela (13 May 2019 10:27 UTC)

Re: Unicode lambda Per Bothner (12 May 2019 14:17 UTC)

Re: Unicode lambda Peter (12 May 2019 15:06 UTC)

Re: Unicode lambda Lassi Kortela 14 May 2019 08:24 UTC

> I agree with the peril of feature bloat.
> On the other hand, the problem of using S-expression declare-file form
> is that it conflates meta-information into contents.  Suppose you have a
> config file whose format is a sequence of arbitrary S-expressions.  How do
> you attach encoding declaration to it?  Note that the config directive
> may happen to
> begin with a symbol 'declare-file', in application-specific semantics.

You are absolutely correct. Starting a file with a particular Lisp form
means that that form becomes part of the "schema" of any file format it
touches and the schema parser has to take it into account (even if only
to ignore it). (In Scheme/Lisp code, the "schema" would be extended by
defining a macro.) That's the main drawback of this approach, and that's
why it should be optional (with the default coding being UTF-8 or
whatever the implementation default is).

> READ should read arbitrary (syntactically valid) S-expressions, without
> interpreting
> its semantics, because interpretation is up to the caller of READ.

Isn't reading an entire source file different from reading one
S-expression though? When you read a source file you have to support
shebang lines (#!) and encoding declarations. When you call READ, you
can assume that the textual port already has the correct encoding and
you don't need to worry about a shebang line.

So read a source file = parse shebang line and encoding from a binary
port + convert to textual port + call normal READ in a loop until EOF.

> The encoding information, however, is tricker than
> #!case-fold or #,(construuctor arg ...), since you can't read it as a
> text until you know
> the encoding.  So even we adopt one of the existing reader syntax,
> encoding recogniztion
> will likely to be implemented separately from the existing reader syntax
> handling mechanism.

Exactly. It's a chicken-and-egg problem :) S-expressions look and feel
different from magic comments but in this sense they are not. We could
put an encoding tag in S-exprs, XML (as John had in that blog post - in
fact, XML already has a standard way to write an encoding attribute in
XML itself), JSON, CSV or anything else and we'd have this same basic
problem we have with parsing comments - no better, no worse.

The point of putting it in S-exprs is that since we use them for code
anyway, we might as well extend what we already have. Similarly, if we
were already using JSON we could declare the encoding in the JSON object
(in fact we would have to, since JSON doesn't support comments).

I think magic comments would be the better choice (for compatibility
reasons) if there was a push to standardize the format of all magic
comments across popular languages so they can be parsed robustly, but
alas there is not. With S-exprs we at least have something principled.

> The #!-identifier, and srfi-10 #,ctor syntax, specifically exist to
> communicate with READ
> out-of-band from the S-expressions.  If we want to piggy-back with
> existing mechanism,
> we can use either one of them.

This is true. With Marc's comment of float precision elsewhere in the
thread, it might be good if #! was allowed to take a list instead of a
symbol. In that case we could have #!(encoding euc-jp).

The main problem with #! is that it can occur anywhere in the file, so
if encoding comes from #! then it can change in the middle of the file.
(This probably doesn't make sense for any practical purpose, so the
reader could raise an error when it gets the second encoding tag in the
same file.) I would perhaps avoid putting the encoding (and other things
that are meant to affect an entire stream, never only one part of it) in
#! because it "gives the wrong signal" to users about what is possible
to do with it.

With #!(encoding ...) one might also change the encoding in the middle
of a REPL session, but I don't know if that makes sense either. A
terminal is supposed to have the same encoding constantly.

> Well, if there's not so much agreement on this, I don't see its worth to
> standardize; we'll probably be able to stick with utf-8.

Yes. With the spread of UTF-8, this flexibility seems vaguely like we
would be creating more problems and unnecessary complexity for future
implementors :) Maybe it's best to keep doing what we are doing now, not
write any encoding SRFI, and wait a few years until UTF-8 has completely
taken over and R8RS can mandate UTF-8 source code. The practical
situation already is that most code and terminals are UTF-8 (or plain
ASCII). I'll drop my suggestion.

> communicate with READ out-of-band from the S-expressions.

Just to clarify this point - I had thought of the declare-file form
mainly for other purposes; the encoding is just one little thing that it
could have. I probably presented my thoughts in a confusing way because
the emphasis has been on encodings in this discussion. I would not
specify a declare-file form if the _only_ thing it did was to give the
encoding. Rather, it would be a mechanism for specifying many different
kinds of things about a file (many of which we cannot yet anticipate,
and that's the point - it would be useful to have a standard place where
to declare things on a file scope with room for arbitrary extensions,
many of which could be implementation-specific or even project-specific).

Most such metadata is not really meant to be read out-of-band from the
normal reader (the encoding declaration would be the only such thing I
can think of). A declare-file form would be a valid S-expression in
Scheme's normal syntax so the reader could just read it as normal
(assuming the corresponding macro is defined). It #!(lists ...) are
allowed then perhaps there could be an alternate version #!(declare-file
...) if a version that doesn't add a form to the READ results is wanted.
In fact, it may be a good idea.

The more ideas we throw around about all this reader stuff, the more I
grow to like the proposed #!(list ...) read syntax :) We could specify
it so that #!foo is equivalent to #!(foo). I think we should permit an
arbitrary form inside it since it may be useful to have extensions and
we already have a full reader at our disposal. There are probably some
details that will cause problems if absolutely all Scheme syntax is
permitted inside it but we can map those out.