Re: Should SRFI-115 character sets match extended grapheme clusters? Mark H Weaver (11 May 2014 19:55 UTC)

Re: Should SRFI-115 character sets match extended grapheme clusters? Mark H Weaver 11 May 2014 19:53 UTC

John Cowan <xxxxxx@mercury.ccil.org> writes:

> Mark H Weaver scripsit:
>
>> It occurs to me that users of languages that make heavy use of combining
>> marks will likely find the behavior of "character sets" to be quite
>> unintuitive if they operate on code points.
>
> The way around that is normalization of the input, I think.

Normalization is an important part of the solution, but it alone does
not solve the problem where no precomposed character exists.  Figure 5
of TR15 gives some examples where NFC produces more than one codepoint
per character.

The question then becomes: Do we want ("ḍ̇q̣̇") to mean (or "ḍ̇" "q̣̇") or
should it mean (or "ḍ" "\x0307;" "q" "\x0323;" "\x0307;")?  It's a
question of how the string is split into elements.

There's also the question of whether (regexp-extract '(~ ("-")) "q̣̇")
should return ("q̣̇") or ("q" "\x0323;" "\x0307;").

> I will be proposing a normalization SRFI in future, presumably
> including the R6RS normalization procedures and some version of the
> normalized-comparison procedures that were rejected from R7RS-small.

Sounds useful.

     Mark