>The other thing I didn't care for (in UTR 15) was the recommendation to >use NFC for case-sensitive languages and NFKC for case-insensitive >languages. NFC is designed for round-trip conversions, and it often uses >different encodings for visually indistinguishable symbols. For example, >the letters "ffi" and the "ffi" ligature are distinct under NFC (IIRC). >That's a very bad property for programming language identifiers. >Unfortunately, NFKC isn't perfect either. One thing I especially dislike >is that it flattens the differences between the mathematical alphabets. >Here you have a case where graphemes *are* visually distinguishable, and >for good reason, but the normalization form treats them as identical. If >you're working on a sublanguage for symbolic mathematics, you might be >tempted to write "double-struck small letter j" for a unit vector and >"italic small letter j" for the imaginary unit. But NFKC folds them >together. You'll need to modify NFKC for mathematics, or track the >semantic data separately (which amounts to the same thing). I'm pretty sure that the hashcode of an identifier (its 'identity') needs to be calculated from a normalized form of that identifier. I'm also pretty sure that a system which didn't also literally convert identical-hashing identifiers to identical representation and screen appearance would be worse instead of better, because the fundamental property of identifiers is that we the programmers need to be able to distinguish between them correctly. Note: In the following, I'm not going to talk about NFD or NFKD; they are equivalent in expressive power to NFC and NFKC, respectively, and we are unconcerned with issues of strictly internal representation. The points below apply to them as well. As you correctly point out there are problems with both NFC and with NFKC. However, the existence of NFKC makes the use of NFC identifiers risky in several ways. Although I like NFC better technically, I'd have to recommend NFKC as a standard because of the danger that text created to be distinguishable under NFKC is "safe" from bogus format changes to NFC, but the reverse is not true. Three scenarios follow. If an implementation goes with NFC, and an editor (even an editor that understands that these are different characters and doesn't confuse the save) fails to make separate, distinct glyphs for all those mathematical variants, then code that uses them to form indistinguishable-looking identifiers that are actually distinct is going to appear damned mysterious to users of that editor. Nothing breaks except the user's mind as he attempts to understand the code. However, if code is written for an NFC system and then read on a system that folds all of those distinctions by converting identifiers into NFKC form, then we will have the same problem, except that the compiler is now confused rather than the user. Now imagine an NFC system containing a source file with a function using identifiers which would be indistinguishable under NFKC -- possibly a math library. A user loads the source into an editor that treats it as "text" meaning NFKC, and without necessarily even seeing the function that uses the now-identical identifiers, he edits some unrelated function and saves the file. Breakage follows even though he has not migrated between compilers. We can 'standardize' NFC, which is capable of finer distinctions, and users will occasionally, until their personal learning curves are climbed, be burned by editors, tools, and nonconforming systems that treat program text as NFKC. Meanwhile code written for those nonconforming systems will not be burned by tools and editors and compilers that conform. or we can 'standardize' NFKC, which is capable of less distinction, thereboy limiting the range of identifiers available to programmers, and develop a generation of editors and compilers that will make the later adoption of NFC effectively impossible because program text for an NFC system cannot be reasonably kept out of 'standard' editors and tools. Or we can let the standard not-speak on this issue, and let a thousand flowers bloom... with chaotic results and lots of source code that won't go between systems and ad-hoc tools for converting from one form to another and etc etc etc with general chaos and nonportable code ensuing for decades. The disadvantages of this are obvious, but the advantage is that we don't lose implementors by demanding that they 'go with' one or the other before they've reached a real consensus about which is better. Remember, it's not a standard unless people follow it, and we're documenting which direction people have gone as much as we are telling them which direction to go. We don't know which direction they're going yet, and this bunch is as predictable as a herd of cats. If we try to standardize it, we may be guessing wrong. >It's especially bad if you're considering a language for typesetting >mathematics! (Not that anybody would ever want to implement a TeX-like >language in Scheme, right?) The Unicode character set is well-suited to >that task, but the normalization forms aren't, IMO. ah-heh... I assume there's a smiley in there somewhere... Bear