On Sun, May 11, 2014 at 10:49 PM, Mark H Weaver <xxxxxx@netris.org> wrote:
Hello all,

It occurs to me that users of languages that make heavy use of combining
marks will likely find the behavior of "character sets" to be quite
unintuitive if they operate on code points.  For example, they might
reasonably expect ("éè") to match either of two graphemes, and never to
match a bare 'e' or a bare combining mark.  They might also expect
(~ ("aeiou")) to match "é", even when represented as multiple code
points.

An excellent point.  I'm inclined to agree that we should handle
grapheme clusters naturally, though it's not completely clear that
this is always the right thing.  Users may want to create character
sets with composing characters and have them matched individually,
not composed with any preceding character (though the workaround
here would be to include them as characters, not inside string literals).

Regardless, I think for practical reasons we're stuck with codepoints
for now.  There exist (to my knowledge) no implementations which
handle grapheme cluster aware char-sets currently, and the overhead
is potentially substantial.  Even simple mappings of large Unicode
char-sets can be expensive to compute (until the ref impl optimized
known case-insensitive char-sets I believe (w/nocase letter) took
over a minute to iterate over all 10k+ Letter code-points, look up all
their case variants an insert them into a new set).  So I'd consider
this an open area for improvement.

We could make room for this, in the same way that we allow but
don't require w/nocase to handle non-1-to-1 mappings.  The concern
again is how do we distinguish ("éè") with non-composed characters
from the character set where they want to match e and the two composing
characters separately, and how do we translate the distinction to pcre?

-- 
Alex