So I suggest to keep encoding directive simple, parsable with a small finite automaton without lookahead.
Agreed. That makes it tantamount to a known-plaintext attack, and since there is no attempt at secrecy, that's easy. See <
http://recycledknowledge.blogspot.com/2005/07/hello-i-am-xml-encoding-sniffer.html> for an algorithm for sniffing XML encodings, where a declaration of the form <!?xml encoding="blahblah"?> (but with some additional bells and whistles possible) is not required if the encoding is UTF-8, any UTF-16 variant, or any UTF-32 variant.
Does #!encoding count? Or should it be ignored?
Ignored unless it is on a line by itself and as early as possible. Except in EBCDIC files (Crom forbid it!), no non-ASCII characters should appear before it.
That said, UTF-8 is as near as not universal now: 95% of all web documents, though rather less on particular local systems.
John Cowan
http://vrici.lojban.org/~cowan xxxxxx@ccil.orgIf I have seen farther than others, it is because I was looking through a
spyglass with my one good eye, with a parrot standing on my shoulder. --"Y"