Re: Should SRFI-115 character sets match extended grapheme clusters?
John Cowan 11 May 2014 18:08 UTC
Mark H Weaver scripsit:
> It occurs to me that users of languages that make heavy use of combining
> marks will likely find the behavior of "character sets" to be quite
> unintuitive if they operate on code points.
The way around that is normalization of the input, I think. I will be
proposing a normalization SRFI in future, presumably including the R6RS
normalization procedures and some version of the normalized-comparison
procedures that were rejected from R7RS-small.
> I realize that most languages (including Scheme) treat code points as
> characters, that SRFI-14 character sets are really sets of code points,
> and that most regexp libraries probably do the same thing. However, it
> also seems to me that these are most likely mistakes, with bad
> consequences for the usability of regexps in many languages.
Trying to hide normalization and grapheme clusters under the table runs
into the problem that the definition of grapheme clusters keeps growing.
First there were default GCs (now known as legacy GCs), then there were
extended GCs, and there is still the possibility of tailored GCs for
specific language or locales. In addition, once we are normalized,
there is typically only one representation of a grapheme cluster, so
it doesn't affect RE processing.
Better, I think, to keep REs working on codepoints, the lowest common
denominator, and outsource everything else.
--
John Cowan http://www.ccil.org/~cowan xxxxxx@ccil.org
You annoy me, Rattray! You disgust me! You irritate me unspeakably!
Thank Heaven, I am a man of equable temper, or I should scarcely be able
to contain myself before your mocking visage. --Stalky imitating Macrea