Re: Encodings. Bradd W. Szonye 12 Feb 2004 17:17 UTC

On Thu, Feb 12, 2004 at 07:41:07AM +0100, Ken Dickey wrote:
> Back to dumb questions.
>
> I assume that it is useful to distinguish the two goals of
> 	extending programming language identifiers
> and	processing Unicode data.
>
> For identifiers, either we have EQ? preserving literals, or
> "literalization of bits" (I.e. string preservation).
>
> So w.r.t. identifiers, why is normalization needed at all? To my mind,
> normalization is a library procedure (set of procedures) for dealing
> with Unicode data/codepoints.

Normalization is a way to eliminate "trivial" differences between
strings. There's often several ways to encode exactly the same character
(grapheme), and normalization is a procedure for folding all of the
variants down to a single, canonical encoding.

If you're doing a simple test for exact string equality (string=, for
example, but not string-ci=) then normalization is both necessary and
sufficient to prepare for it. It's necessary, because without it,
trivial differences will result in false negatives. It's also sufficient
for a simple grapheme-by-grapheme (or binary) comparison.

> Defining valid identifier syntax such that case folding of
> (unnormalized) identifier literals should be sufficient.
>
> What am I missing?

If you're already folding case or otherwise saying "these characters are
equivalent" (i.e., using string collation for equality testing), then I
suppose you don't *need* to normalize. I think it does simplify
processing a bit, because you deal with all the encoding quirks first,
then you deal with the language quirks.

Or to put it another way, case folding is just a specific kind of
normalization, that removes the "trivial encoding differences" between
variants of the letter A.

By the way, regarding the issue I brought up about Latin B vs Greek B:
After posting, I realized that might be better to handle that with
collation rules instead of normalization (folding). Then again, I
suppose that it doesn't make much of a difference. The two operations
are equivalent with regard to equality testing (although they do have
different side effects).

Hm, I wonder whether the UC expects people with special collation needs
to use NFC for normalization, followed by a domain-specific folding or
collation step? That's kind of weird, though, because the second step
will often include some (but not all) of the compatibility
transformations.
--
Bradd W. Szonye
http://www.szonye.com/bradd