Encodings. bear 12 Feb 2004 07:44 UTC


>The other thing I didn't care for (in UTR 15) was the recommendation to
>use NFC for case-sensitive languages and NFKC for case-insensitive
>languages. NFC is designed for round-trip conversions, and it often uses
>different encodings for visually indistinguishable symbols. For example,
>the letters "ffi" and the "ffi" ligature are distinct under NFC (IIRC).
>That's a very bad property for programming language identifiers.

>Unfortunately, NFKC isn't perfect either. One thing I especially dislike
>is that it flattens the differences between the mathematical alphabets.
>Here you have a case where graphemes *are* visually distinguishable, and
>for good reason, but the normalization form treats them as identical. If
>you're working on a sublanguage for symbolic mathematics, you might be
>tempted to write "double-struck small letter j" for a unit vector and
>"italic small letter j" for the imaginary unit. But NFKC folds them
>together. You'll need to modify NFKC for mathematics, or track the
>semantic data separately (which amounts to the same thing).

I'm pretty sure that the hashcode of an identifier (its 'identity')
needs to be calculated from a normalized form of that identifier.
I'm also pretty sure that a system which didn't also literally
convert identical-hashing identifiers to identical representation
and screen appearance would be worse instead of better, because the
fundamental property of identifiers is that we the programmers need
to be able to distinguish between them correctly.

   Note: In the following, I'm not going to talk about NFD or NFKD; they
   are equivalent in expressive power to NFC and NFKC, respectively, and
   we are unconcerned with issues of strictly internal representation.
   The points below apply to them as well.

As you correctly point out there are problems with both NFC and with
NFKC. However, the existence of NFKC makes the use of NFC identifiers
risky in several ways.  Although I like NFC better technically, I'd
have to recommend NFKC as a standard because of the danger that text
created to be distinguishable under NFKC is "safe" from bogus format
changes to NFC, but the reverse is not true.  Three scenarios follow.

If an implementation goes with NFC, and an editor (even an editor that
understands that these are different characters and doesn't confuse
the save) fails to make separate, distinct glyphs for all those
mathematical variants, then code that uses them to form
indistinguishable-looking identifiers that are actually distinct is
going to appear damned mysterious to users of that editor.  Nothing
breaks except the user's mind as he attempts to understand the code.

However, if code is written for an NFC system and then read on a
system that folds all of those distinctions by converting identifiers
into NFKC form, then we will have the same problem, except that the
compiler is now confused rather than the user.

Now imagine an NFC system containing a source file with a function
using identifiers which would be indistinguishable under NFKC --
possibly a math library.  A user loads the source into an editor that
treats it as "text" meaning NFKC, and without necessarily even seeing
the function that uses the now-identical identifiers, he edits some
unrelated function and saves the file.  Breakage follows even though
he has not migrated between compilers.

We can 'standardize' NFC, which is capable of finer distinctions, and
users will occasionally, until their personal learning curves are
climbed, be burned by editors, tools, and nonconforming systems that
treat program text as NFKC.  Meanwhile code written for those
nonconforming systems will not be burned by tools and editors and
compilers that conform.

or we can 'standardize' NFKC, which is capable of less distinction,
thereboy limiting the range of identifiers available to programmers,
and develop a generation of editors and compilers that will make the
later adoption of NFC effectively impossible because program text for
an NFC system cannot be reasonably kept out of 'standard' editors and
tools.

Or we can let the standard not-speak on this issue, and let a thousand
flowers bloom... with chaotic results and lots of source code that
won't go between systems and ad-hoc tools for converting from one form
to another and etc etc etc with general chaos and nonportable code
ensuing for decades. The disadvantages of this are obvious, but the
advantage is that we don't lose implementors by demanding that they
'go with' one or the other before they've reached a real consensus
about which is better.  Remember, it's not a standard unless people
follow it, and we're documenting which direction people have gone as
much as we are telling them which direction to go.  We don't know
which direction they're going yet, and this bunch is as predictable as
a herd of cats.  If we try to standardize it, we may be guessing
wrong.

>It's especially bad if you're considering a language for typesetting
>mathematics! (Not that anybody would ever want to implement a TeX-like
>language in Scheme, right?) The Unicode character set is well-suited to
>that task, but the normalization forms aren't, IMO.

ah-heh...  I assume there's a smiley in there somewhere...

				Bear