Re: the discussion so far
John.Cowan 18 Jul 2005 14:09 UTC
Michael Sperber scripsit:
> US-ASCII, ISO 8859-1, and UCS-2-based [...]
> subsets are all closed with respect to the case folding in
> UnicodeData.txt. I don't know offhand if that's also the case with
> full Unicode case folding.
It is not true of either simple or full case folding as specified in
CaseFolding.txt; in particular, the 8859-1 character MICRO SIGN (0xB5,
U+00B5) folds to a proper GREEK SMALL LETTER MU (U+03BC) as a consequence
of the compatibility equivalence between the two.
There are also encodings which are not closed even under lowercasing:
of the 123 encodings I have information for, 30 are not closed under
lowercasing, 54 are not closed under simple folding, and 60 are not
closed under full folding. (Details on request.)
Jorgen Schaefer scripsit:
> Luckily, case folding is specified in such a way that a normalized
> sequence of code points remains normalized if case-folded.
This is exactly backwards. Case folding does *not* preserve normalization,
but *does* work correctly even on unnormalized input. For example,
the sequence <0130> is in normalization form C, but folds to
<0069,0307>, which is not.
I do agree that normalization functions are a Good Thing, though not
necessarily for the Scheme core.
--
Overhead, without any fuss, the stars were going out.
--Arthur C. Clarke, "The Nine Billion Names of God"
John Cowan <xxxxxx@reutershealth.com>