Re: the discussion so far

Re: the discussion so far John.Cowan 18 Jul 2005 14:09 UTC
Michael Sperber scripsit:

> US-ASCII, ISO 8859-1, and UCS-2-based [...]
> subsets are all closed with respect to the case folding in
> UnicodeData.txt.  I don't know offhand if that's also the case with
> full Unicode case folding.

It is not true of either simple or full case folding as specified in
CaseFolding.txt; in particular, the 8859-1 character MICRO SIGN (0xB5,
U+00B5) folds to a proper GREEK SMALL LETTER MU (U+03BC) as a consequence
of the compatibility equivalence between the two.

There are also encodings which are not closed even under lowercasing:
of the 123 encodings I have information for, 30 are not closed under
lowercasing, 54 are not closed under simple folding, and 60 are not
closed under full folding.  (Details on request.)

Jorgen Schaefer scripsit:

> Luckily, case folding is specified in such a way that a normalized
> sequence of code points remains normalized if case-folded.

This is exactly backwards.  Case folding does *not* preserve normalization,
but *does* work correctly even on unnormalized input.  For example,
the sequence <0130> is in normalization form C, but folds to
<0069,0307>, which is not.

I do agree that normalization functions are a Good Thing, though not
necessarily for the Scheme core.

--
Overhead, without any fuss, the stars were going out.
        --Arthur C. Clarke, "The Nine Billion Names of God"
                John Cowan <xxxxxx@reutershealth.com>