Re: Encodings. Bradd W. Szonye 12 Feb 2004 09:56 UTC

In response to my rant about UTR 15, Annex 7:

bear wrote:
> I'm pretty sure that the hashcode of an identifier (its 'identity')
> needs to be calculated from a normalized form of that identifier. I'm
> also pretty sure that a system which didn't also literally convert
> identical-hashing identifiers to identical representation and screen
> appearance would be worse instead of better, because the fundamental
> property of identifiers is that we the programmers need to be able to
> distinguish between them correctly.

Exactly. Unfortunately, it's a bit difficult to pin down "identical
representation." Does your editor represent "math italic" and
"blackboard bold" identically, or does it use two separate,
distinguishable typefaces? Or more subtly: Even if your editor does
represent "ffi" differently as letters and ligature, can you see the
difference? Coding standards include guidelines like, "Don't use two
identifiers that differ by only one character or otherwise look alike."

It's hard to tell how much of this to handle in the language and how
much to leave to coding standards. For example, what about the
identifiers "resume" and "rèsumé"? The editor can probably distinguish
the two, but what about the programming language? Should it fold them
together? Should it only fold them in some locales (like English, where
they collate identically)? Should it treat them as different, and leave
it up to the coding-standards gurus?

In other words, should the compiler help in deciding when two
identifiers are too much "alike"? Should it treat them as if they really
are the same? Where do you draw the line? Is it locale-dependent?
Domain-dependent? (For example, I'm not sure how to feel about the math
alphabets. On the one hand, I think they're great, because they make it
easier to put math in plain text or source code, without markup or
binary format issues. On the other hand, the only complicate this
problem.)

> As you correctly point out there are problems with both NFC and with
> NFKC. However, the existence of NFKC makes the use of NFC identifiers
> risky in several ways.

I think NFKC is necessary, though. I think it's far superior to NFC for
most applications that don't require round-tripping, and I get the
impression that it's the "preferred" way to normalize texts. The Unicode
Consortium has chosen a very specific level of abstraction. I think it's
easiest to describe it with a "keys on a typewriter" metaphor. On a
typewriter, there's no "degrees Fahrenheit" key, because you can type it
just as well as "degree sign, capital F." Likewise, a typewriter doesn't
have different keys for "numeral 1" and "superscript numeral 1." Nor
does it have a "universal radix sign"; you must choose a full-stop,
comma, apostrophe, thin space, etc. as appropriate for your audience.

The Consortium prefers that abstraction, and therefore I think they
prefer NFKC, because it flattens all of the semantic and arbitrary
details into pure "typewriter forms." They provide NFC because it's
necessary for round-tripping and intermediate encodings -- for
compatibility with other systems, basically. But I think they see that
as a necessary evil.

And because of that, I'm a bit surprised that they added several
complete math alphabets to the character set. It isn't necessary for
interchange, and it goes against the "one abstract font to rule them
all" concept. What's really aggravating is that they're "compatibility
forms," even though they weren't added for compatibility's sake! In my
mind, there's a difference between folding an "Anstrom sign" and folding
math alphabets, but the character base treats them both the same.
Perhaps this is a subtle hint to discourage their use.

Oops, sorry, got off on a rant there.

> Although I like NFC better technically, I'd have to recommend NFKC as
> a standard ....

I partly agree. I don't like NFC better technically; I think it's the
wrong normalization form to use unless you're doing some kind of
interchange or round-tripping. It preserves many distinctions that are
meaningless to anything but a text-conversion program. For example,
preserving typesetter's ligatures is useless in most applications; with
only a few exceptions, you can handle that in the presentation layer.

Nor do I really like NFKC as a standard. I think it's too naive in some
ways and too incomplete in other ways. It's good enough for some
applications, but it's no substitute for domain-specific normalization.
Examples of both:

If you're reinventing TeX, it would be *very* handy to preserve the math
alphabets, but you'd probably want to fold away most of the other
compatibility characters. In this way, NFKC is too naive.

If you're implementing a general-purpose programming language, you might
want to just fold away all of the compatibility characters, because
they're too prone to cause confusion. But even after you do that, you
still have "Latin capital letter B" and "Greek capital letter Beta" in
the source code, and the programmer will never, ever figure out why the
damn compiler is complaining about an undefined variable name. In this
way, NFKC is naive and incomplete.

There really isn't any substitute for a domain-specific normalization
form that keeps everything you care about and folds the rest. But if you
don't have that, NFKC is better than the alternatives.

> We can 'standardize' NFC, which is capable of finer distinctions, and
> users will occasionally, until their personal learning curves are
> climbed, be burned by editors, tools, and nonconforming systems that
> treat program text as NFKC.

They'll also get burned even when all the tools work, because there are
too many indistinguishable compatibility forms. I don't know what the
author of UTR 15 was thinking when he recommended it for case-sensitive
language identifiers. I don't think it's even remotely appropriate for
any programming language (except maybe Intercal or APL).

> or we can 'standardize' NFKC, which is capable of less distinction,
> thereboy limiting the range of identifiers available to programmers,
> and develop a generation of editors and compilers that will make the
> later adoption of NFC effectively impossible because program text for
> an NFC system cannot be reasonably kept out of 'standard' editors and
> tools.

Hey, that's not a bad thing! I'd rather not use NFKC, but I'd much
rather use it than NFC.

> Or we can let the standard not-speak on this issue, and let a thousand
> flowers bloom... with chaotic results and lots of source code that
> won't go between systems and ad-hoc tools for converting from one form
> to another and etc etc etc with general chaos and nonportable code
> ensuing for decades. The disadvantages of this are obvious, but the
> advantage is that we don't lose implementors by demanding that they
> 'go with' one or the other before they've reached a real consensus
> about which is better.

In my very serious opinion, I think NFKC is insufficient for programming
language syntax, but better than the alternatives. (And no, there is no
smiley in there, not even an invisible one. I mean it.)

>> It's especially bad if you're considering a language for typesetting
>> mathematics! (Not that anybody would ever want to implement a
>> TeX-like language in Scheme, right?) The Unicode character set is
>> well-suited to that task, but the normalization forms aren't, IMO.

> ah-heh...  I assume there's a smiley in there somewhere...

I don't use smileys, but you got the right idea!
--
Bradd W. Szonye
http://www.szonye.com/bradd