Re: Should SRFI-115 character sets match extended grapheme clusters?
John Cowan 11 May 2014 21:39 UTC
Mark H Weaver scripsit:
> Normalization is an important part of the solution, but it alone does
> not solve the problem where no precomposed character exists. Figure 5
> of TR15 gives some examples where NFC produces more than one codepoint
> per character.
Ah, I understand now. The trouble is that normalization of a char-set
pattern causes it to mean something completely different. Thus ("á")
(i.e. ("\xE1;") matches the character \#xE1;, whereas ("aÌ")
(i.e. ("a\x301;")) although canonically equivalent to it, matches the
disjunction of #\x61; and #\x301;. They will never match the same thing,
which is counterintuitive. Unfortunately, I don't see what can be done
about this other than to issue stern warnings in the documentation.
Alex, do you think you can make a w/norm or norm-char-set SRE pattern
work? It would mean transforming a charset pattern containing "a\x301;"
to one that contains "\xE1;", and also transforming a pattern containing
"f\x301;" (which has no precomposed form) into (seq #\f #\x301;).
In the general case it would produce an alternation of sequences,
and would have to normalize the part of the text being matched as well
(unless it comes in two flavors, one for NFD and the other for NFC).
--
John Cowan http://www.ccil.org/~cowan xxxxxx@ccil.org
We pledge allegiance to the penguin and to the intellectual property
regime for which he stands, one world under Linux, with free music
and open source software for all. --Julian Dibbell on Brazil, edited