Re: Response to SRFI 75. bear 12 Jul 2005 09:14 UTC


On Tue, 12 Jul 2005, Michael Sperber wrote:

>
>bear <xxxxxx@sonic.net> writes:
>
>> Particularly, some characters, particularly accented characters,
>> have uppercase and lowercase versions which are different numbers of
>> codepoints.  Thus, in the "codepoint equals character" model, one
>> case is a character and the other case -- isn't.
>
>I don't quite understand what you're saying: the locale-independent
>case mappings in UnicodeData.txt always map a single scalar value to a
>single scalar value.  Sure it doesn't always do what your locale
>thinks (as you point out), but this case mapping doesn't require
>"multi-codepoint characters."

Okay, after performing a quick check, the ones that require
multi-codepoint mappings simply don't have altercases specified
in UnicodeData.txt.  What I was thinking of were characters like
U+FB01 small ligature fi, which has no corresponding single-
codepoint uppercase. Finding the lowercase of such a character
is not going to work - but okay, that's the sacrifice you make
for the single codepoint/single character confusion.

>> Sixth, is there any way for a scheme implementation to support
>> characters and strings in addutional encodings different from
>> unicode and not necessarily subsets of it, and remain compliant?
>
> I don't think so, at least not in the way you envision.  I don't think
> that's necessary or even a good idea, either.  This SRFI effectively
> hijacks the char and string datatypes and says that the abstractions
> for accessing them deal in Unicode.  Any representation that allows
> you to do that---i.e. implement STRING-REF, CHAR->INTEGER, and
> INTEGER->CHAR and so on in a way compatible with the SRFI is fine,
> but I believe you're thinking about representations where that's not
> the case.

Hmmm.  I'm still of the opinion that making the programming
abstraction more closely match the end-user abstraction (ie,
with glyph=character rather than codepoint=character) is just
plain better, in many ways.  It gives me the screaming willies
that under Unicode, strings which to the eye look identical,
can have different lengths, no codepoint at any particular
index in common, and sort relative to each other such that
there are an infinite number of unrelated strings that go
between them.  To me, it is the codepoint=character model that
is introducing representation artifacts and the glyph=character
model comes a lot closer to avoiding them.

But we've been there, and I've talked about that, at length.
People seem determined to do it this way, and people with
other languages seem to be doing it mostly this way too. I'm
convinced that requiring the "wrong" approach in a way that
outlaws a better one is a wrong thing, but I'm realistic by
now that nobody else is going to be convinced.

Also, I'm not entirely happy about banning characters and
character sets that aren't subsets of unicode.  In the first
place there are a lot of characters that aren't in Unicode
and are likely never to be - ask a Chinese person to write
his own address without using one and you'll begin to see
the problem.  And in the second place, traditionally the
characters have been used to describe a lot of non-character
entities - and while some of these come through in control codes,
others, including the very useful keystroke-description codes
from, eg, MITscheme, simply don't.

				Bear