On Mon, Oct 21, 2019 at 1:19 PM Shiro Kawai <xxxxxx@gmail.com> wrote:

So I suggest to keep encoding directive simple, parsable with a small finite automaton without lookahead.

Agreed.   That makes it tantamount to a known-plaintext attack, and since there is no attempt at secrecy, that's easy.  See <http://recycledknowledge.blogspot.com/2005/07/hello-i-am-xml-encoding-sniffer.html> for an algorithm for sniffing XML encodings, where a declaration of the form <!?xml encoding="blahblah"?> (but with some additional bells and whistles possible) is not required if the encoding is UTF-8, any UTF-16 variant, or any UTF-32 variant.

Does #!encoding count?  Or should it be ignored?

Ignored unless it is on a line by itself and as early as possible.  Except in EBCDIC files (Crom forbid it!), no non-ASCII characters should appear before it.

That said, UTF-8 is as near as not universal now: 95% of all web documents, though rather less on particular local systems.



John Cowan          http://vrici.lojban.org/~cowan        xxxxxx@ccil.org
If I have seen farther than others, it is because I was looking through a
spyglass with my one good eye, with a parrot standing on my shoulder. --"Y"