Re: the discussion so far

Show/hide message thread

the discussion so far Matthew Flatt (16 Jul 2005 12:41 UTC)
(missing)
(missing)
(missing)
Re: the discussion so far bear (20 Jul 2005 02:45 UTC)
Re: the discussion so far John.Cowan (20 Jul 2005 03:56 UTC)
(missing)
Re: the discussion so far Alex Shinn (20 Jul 2005 02:50 UTC)
Re: the discussion so far Thomas Bushnell BSG (20 Jul 2005 02:56 UTC)
Re: the discussion so far Alex Shinn (20 Jul 2005 03:15 UTC)
Re: the discussion so far Thomas Bushnell BSG (20 Jul 2005 03:24 UTC)
Re: the discussion so far Alex Shinn (20 Jul 2005 03:38 UTC)
Re: the discussion so far Thomas Bushnell BSG (20 Jul 2005 03:49 UTC)
Re: the discussion so far John.Cowan (20 Jul 2005 04:24 UTC)
Re: the discussion so far Thomas Bushnell BSG (20 Jul 2005 04:27 UTC)
Re: the discussion so far John.Cowan (20 Jul 2005 04:58 UTC)
Re: the discussion so far Thomas Bushnell BSG (20 Jul 2005 05:04 UTC)
Re: the discussion so far Jorgen Schaefer (16 Jul 2005 13:05 UTC)
Re: the discussion so far Matthew Flatt (16 Jul 2005 13:21 UTC)
Re: the discussion so far Jorgen Schaefer (16 Jul 2005 13:58 UTC)
Re: the discussion so far Thomas Bushnell BSG (17 Jul 2005 02:42 UTC)
Re: the discussion so far Thomas Bushnell BSG (17 Jul 2005 02:57 UTC)
Re: the discussion so far Jorgen Schaefer (17 Jul 2005 03:33 UTC)
Re: the discussion so far bear (16 Jul 2005 18:07 UTC)
Re: the discussion so far John.Cowan (17 Jul 2005 04:49 UTC)
Re: the discussion so far Thomas Bushnell BSG (17 Jul 2005 02:40 UTC)

Re: the discussion so far Jorgen Schaefer 16 Jul 2005 13:58 UTC

Matthew Flatt <xxxxxx@cs.utah.edu> writes:

> So, the `char-ci' operations should use the "simple case folding" table
> from CaseFolding.txt, and the `string-ci' operations should use the
> "full case folding" table from CaseFolding.txt. After folding, the
> comparison result is determined character-by-character.

Codepoint-by-codepoint, yes. (That is what you meant, I just
wanted to clarify. The terminology is a bit confusing, as
"character" is defined differently in Unicode than it is in this
SRFI)

> Meanwhile, `string-upcase' and `string-downcase' reflect the same
> improved handling at the string level (compared to the character level)
> by using SpecialCasing.txt in addition to UnicodeData.txt.
>
> Have I got that right?

Yes :-)

There's one last problem with this approach: It leaves out
normalization.

In Unicode, there are multiple sequences of code points that
represent the same character. For example, the code point
sequences (#\x00C4) and (#\x0041 #\x0308) are equivalent.

00C4  LATIN CAPITAL LETTER A WITH DIAERESIS
0041  LATIN CAPITAL LETTER A
0308  COMBINING DIAERESIS

Normalization maps those sequences to a common form (either to the
composed or the decomposed form) so that comparison can be done on
a codepoint-by-codepoint basis.

Luckily, case folding is specified in such a way that a normalized
sequence of code points remains normalized if case-folded.

So, to make STRING-CI=? or, indeed, STRING=? work, one option
would be for the SRFI to provide STRING-NORMALIZE-* procedures,
and require normalized strings to be passed to the comparison
procedures for them to work correctly.

Greetings,
        -- Jorgen

--
((email . "xxxxxx@forcix.cx") (www . "http://www.forcix.cx/")
 (gpg   . "1024D/028AF63C")   (irc . "nick forcer on IRCnet"))