In response to my rant about UTR 15, Annex 7: bear wrote: > I'm pretty sure that the hashcode of an identifier (its 'identity') > needs to be calculated from a normalized form of that identifier. I'm > also pretty sure that a system which didn't also literally convert > identical-hashing identifiers to identical representation and screen > appearance would be worse instead of better, because the fundamental > property of identifiers is that we the programmers need to be able to > distinguish between them correctly. Exactly. Unfortunately, it's a bit difficult to pin down "identical representation." Does your editor represent "math italic" and "blackboard bold" identically, or does it use two separate, distinguishable typefaces? Or more subtly: Even if your editor does represent "ffi" differently as letters and ligature, can you see the difference? Coding standards include guidelines like, "Don't use two identifiers that differ by only one character or otherwise look alike." It's hard to tell how much of this to handle in the language and how much to leave to coding standards. For example, what about the identifiers "resume" and "rèsumé"? The editor can probably distinguish the two, but what about the programming language? Should it fold them together? Should it only fold them in some locales (like English, where they collate identically)? Should it treat them as different, and leave it up to the coding-standards gurus? In other words, should the compiler help in deciding when two identifiers are too much "alike"? Should it treat them as if they really are the same? Where do you draw the line? Is it locale-dependent? Domain-dependent? (For example, I'm not sure how to feel about the math alphabets. On the one hand, I think they're great, because they make it easier to put math in plain text or source code, without markup or binary format issues. On the other hand, the only complicate this problem.) > As you correctly point out there are problems with both NFC and with > NFKC. However, the existence of NFKC makes the use of NFC identifiers > risky in several ways. I think NFKC is necessary, though. I think it's far superior to NFC for most applications that don't require round-tripping, and I get the impression that it's the "preferred" way to normalize texts. The Unicode Consortium has chosen a very specific level of abstraction. I think it's easiest to describe it with a "keys on a typewriter" metaphor. On a typewriter, there's no "degrees Fahrenheit" key, because you can type it just as well as "degree sign, capital F." Likewise, a typewriter doesn't have different keys for "numeral 1" and "superscript numeral 1." Nor does it have a "universal radix sign"; you must choose a full-stop, comma, apostrophe, thin space, etc. as appropriate for your audience. The Consortium prefers that abstraction, and therefore I think they prefer NFKC, because it flattens all of the semantic and arbitrary details into pure "typewriter forms." They provide NFC because it's necessary for round-tripping and intermediate encodings -- for compatibility with other systems, basically. But I think they see that as a necessary evil. And because of that, I'm a bit surprised that they added several complete math alphabets to the character set. It isn't necessary for interchange, and it goes against the "one abstract font to rule them all" concept. What's really aggravating is that they're "compatibility forms," even though they weren't added for compatibility's sake! In my mind, there's a difference between folding an "Anstrom sign" and folding math alphabets, but the character base treats them both the same. Perhaps this is a subtle hint to discourage their use. Oops, sorry, got off on a rant there. > Although I like NFC better technically, I'd have to recommend NFKC as > a standard .... I partly agree. I don't like NFC better technically; I think it's the wrong normalization form to use unless you're doing some kind of interchange or round-tripping. It preserves many distinctions that are meaningless to anything but a text-conversion program. For example, preserving typesetter's ligatures is useless in most applications; with only a few exceptions, you can handle that in the presentation layer. Nor do I really like NFKC as a standard. I think it's too naive in some ways and too incomplete in other ways. It's good enough for some applications, but it's no substitute for domain-specific normalization. Examples of both: If you're reinventing TeX, it would be *very* handy to preserve the math alphabets, but you'd probably want to fold away most of the other compatibility characters. In this way, NFKC is too naive. If you're implementing a general-purpose programming language, you might want to just fold away all of the compatibility characters, because they're too prone to cause confusion. But even after you do that, you still have "Latin capital letter B" and "Greek capital letter Beta" in the source code, and the programmer will never, ever figure out why the damn compiler is complaining about an undefined variable name. In this way, NFKC is naive and incomplete. There really isn't any substitute for a domain-specific normalization form that keeps everything you care about and folds the rest. But if you don't have that, NFKC is better than the alternatives. > We can 'standardize' NFC, which is capable of finer distinctions, and > users will occasionally, until their personal learning curves are > climbed, be burned by editors, tools, and nonconforming systems that > treat program text as NFKC. They'll also get burned even when all the tools work, because there are too many indistinguishable compatibility forms. I don't know what the author of UTR 15 was thinking when he recommended it for case-sensitive language identifiers. I don't think it's even remotely appropriate for any programming language (except maybe Intercal or APL). > or we can 'standardize' NFKC, which is capable of less distinction, > thereboy limiting the range of identifiers available to programmers, > and develop a generation of editors and compilers that will make the > later adoption of NFC effectively impossible because program text for > an NFC system cannot be reasonably kept out of 'standard' editors and > tools. Hey, that's not a bad thing! I'd rather not use NFKC, but I'd much rather use it than NFC. > Or we can let the standard not-speak on this issue, and let a thousand > flowers bloom... with chaotic results and lots of source code that > won't go between systems and ad-hoc tools for converting from one form > to another and etc etc etc with general chaos and nonportable code > ensuing for decades. The disadvantages of this are obvious, but the > advantage is that we don't lose implementors by demanding that they > 'go with' one or the other before they've reached a real consensus > about which is better. In my very serious opinion, I think NFKC is insufficient for programming language syntax, but better than the alternatives. (And no, there is no smiley in there, not even an invisible one. I mean it.) >> It's especially bad if you're considering a language for typesetting >> mathematics! (Not that anybody would ever want to implement a >> TeX-like language in Scheme, right?) The Unicode character set is >> well-suited to that task, but the normalization forms aren't, IMO. > ah-heh... I assume there's a smiley in there somewhere... I don't use smileys, but you got the right idea! -- Bradd W. Szonye http://www.szonye.com/bradd