On Sun, Dec 12, 2021 at 7:43 PM Ray Dillinger <xxxxxx@sonic.net> wrote:

Unicode is IMO horrible when considered as a binary interface to anything.

Since it was never built for that, it's hardly surprising.

Even its basic control characters are horrible if allowed in program text: for example

The only well-defined 8-bit control characters in Unicode are TAB, CR, LF, and NEL; everything else is mapped in from other character sets, and can be changed without affecting the Unicode status of the rest.

The only way to defend against this would be requiring valid source code to have bidirectional nesting strictly isomorphic to the program parse tree nesting.

Or to stop paying attention to how code is rendered and solely to what codepoints it contains, which is what happens in practice. On the other hand, identifiers are both linguistic and non-linguistic at the same time, and that's why they need Unicode. (I read somewhere that Java became popular in Japan before other places was that identifiers could be written in actual orthography, not just romanization or katakana-ization.

My opinion is that if you're going to have unicode strings, you can't use them for any purpose other than representing language. If you can, then you have something which will, by definition, be non-conforming to the unicode standard. It would allow low surrogates to appear by themselves or in the wrong order with high surrogates, or allow them to appear in UTF-32, or allow non-character values to occupy positions in the string, or allow strings to be divided in the middle of a grapheme cluster, or allow you to make non-normalized strings, or any of a thousand other things that result in a bogus, non-conforming string.

I agree with all points but the last. There are multiple levels of normalization, and to make all but one impossible is a grave mistake. The same is true with so-called grapheme clusters: which clusters are appropriate and which are not depends on the language being used.

If you think you need a NUL in the middle of a string, then you are not using it to represent language. Use a blob instead.

Emphatic +1, where blobs are typically bytevectors.