Unicode is IMO horrible when considered as a binary interface to
anything.
Since it was never built for that, it's hardly surprising.
Even its basic control characters are horrible if allowed in program
text: for example
The only well-defined 8-bit control characters in Unicode are TAB, CR, LF, and NEL; everything else is mapped in from other character sets, and can be changed without affecting the Unicode status of the rest.
The only way to defend against this would be
requiring valid source code to have bidirectional nesting strictly
isomorphic to the program parse tree nesting.
Or to stop paying attention to how code is rendered and solely to what codepoints it contains, which is what happens in practice. On the other hand, identifiers are both linguistic and non-linguistic at the same time, and that's why they need Unicode. (I read somewhere that Java became popular in Japan before other places was that identifiers could be written in actual orthography, not just romanization or katakana-ization.
My opinion is that if you're going to have unicode strings, you
can't use them for any purpose other than representing language. If
you can, then you have something which will, by definition, be
non-conforming to the unicode standard. It would allow low
surrogates to appear by themselves or in the wrong order with high
surrogates, or allow them to appear in UTF-32, or allow
non-character values to occupy positions in the string, or allow
strings to be divided in the middle of a grapheme cluster, or allow
you to make non-normalized strings, or any of a thousand other
things that result in a bogus, non-conforming string.
I agree with all points but the last. There are multiple levels of normalization, and to make all but one impossible is a grave mistake. The same is true with so-called grapheme clusters: which clusters are appropriate and which are not depends on the language being used.
If you think you need a NUL
in the middle of a string, then you are not using it to represent
language. Use a blob instead.
Emphatic +1, where blobs are typically bytevectors.