case mappings, character sets & character properties Alex Shinn (10 Feb 2004 03:21 UTC)
Re: case mappings, character sets & character properties bear (10 Feb 2004 04:56 UTC)

Re: case mappings, character sets & character properties bear 10 Feb 2004 04:56 UTC


On Tue, 10 Feb 2004, Alex Shinn wrote:

>First, a few notes on terminology.
>
>  "Letter," by all standard definitions and consistent with Unicode
>  usage, specifically refers an element of an alphabet.  It therefore
>  would not apply to syllabic or ideographic characters.
>
>  "Ideograph" applied to all Han characters is technically incorrect.
>  Linguists prefer the term "sinogram" which refers to Chinese-derived
>  characters.  "Sinogram" fits all uses being applied to the term
>  "ideograph" in these discussions (at least until Unicode adds
>  hieroglyphs).  Since the usage of ideograph is fairly ubiquitous,
>  however, it may not be worth fighting it.

Hm, okay.  Duly noted, I will try to stop misusing the term.

>The concept of case is orthogonal to being alphabetic.  There are
>alphabetic characters with no case, and (Unicode-classified) symbols
>which are given case mappings such as Circled-A (U+24B6).

Argh. Yes.  Thank you.

>Defining anything in terms of character level case procedures seems like
>a bad idea, since any individual character can map to 0-3 characters
>(German eszett, although the most famous, is not by any means the only
>exception here).

Characters which casemap to characters outside the single-codepoint
character set are not a problem for me since my characters aren't
limited to a single codepoint.  I'm mostly here trying to avoid
getting the Right Thing defined out of existence in favor of kluges
and hacks designed to accomodate the shortcomings of single-codepoint
character sets.

Eszett is unique (and the only case where I share this problem with
schemes having only a single-codepoint character set) in that it case
maps not just to a multi-codepoint character, but to multiple separate
characters!

> For Schemes that wish to provide a full linguistic folding of
> identifiers, you definitely want some sort of locale-neutral
> folding.  I posted the general possible combinations on
> c.l.s. earlier.  Unicode does define locale-neutral case-foldings
> which are a subset of those combinations - they break it down into
> whether or not you unify Turkish i (and ignore other accent marks),
> and whether or not to allow folding to more than one character (as
> an optimization).  The "one-character" folding seems fairly
> arbitrary and undesirable if you're going the whole hog anyway.

True.  The fundamental relationship that must hold seems to be that
two symbols foo and bar will be read as the same identifier if and
only if:

(string=? (symbol->string foo)
          (symbol->string bar)) => #t

Looks so simple, doesn't it?  It turns out we've got a lot more going
on in terms of dependent properties and definitions.

>Core Unicode character properties can be provided as SRFI-14 char-sets.

Agreed.  I'd recommend adding one char-set to the list, charset:1code,
the set of characters that can be represented as single unicode
codepoints.

>(Unicode Character Database).  The question remains how to handle R5RS
>character predicates related to these values:

>  * char-alphabetic? char
>  * char-numeric? char
>  * char-whitespace? char   ; rename to char-white-space? please!
>  * char-upper-case? char   ; rename to char-uppercase? please!
>  * char-lower-case? char   ; rename to char-lowercase? please!

:-)  So that bugs you, too, huh?  I agree, those predicates ought
to be renamed. I also agree that there's some question about how
to handle the predicates. I think the correct response is to simply
drop the requirement of char-alphabetic having anything to do with
the case predicates.

				Bear