Email list hosting service & mailing list manager


Re: character strings versus byte strings bear 23 Dec 2003 01:09 UTC


On Mon, 22 Dec 2003, Thomas Bushnell, BSG wrote:

>Matthew Flatt <xxxxxx@cs.utah.edu> writes:
>
>>  * For Scheme characters, pick a specific encoding, probably one of
>>    UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right
>>    choice).
>
>Wrong.  A Scheme character should be a codepoint.  The representation
>of code points as sequences of bytes should be under the hood.

I'm using a homebrewed scheme system where the character set is infinite.
Char->integer may return a bignum.

Each character is a unicode codepoint plus a non-defective sequence of
unicode combining codepoints.  The unicode documentation refers to these
entities as "graphemes."

MITscheme uses an 13-bit character set; 8 bit ascii plus 5 buckybits.
They have characters running around in their set that have nothing to
do with unicode.

I figure I'm going to wind up doing translation no matter what, because
C just isn't capable of hiding the differences between character sizes
correctly.  But I'm not going to give up grapheme-characters, because I
strongly feel that they are the "Right Thing."  And at some point I may
add buckybits just for the hell of it.

My point is that it does no good to assume anything about a scheme's
internal representation of characters.  Some schemes are going to deal with
an infinite character set, not limited at all to unicode codepoints.

So maybe you should pay some attention to cases where there's no
corresponding character in Unicode (MITscheme character "super-meta-J") or
where the unicode correspondence to a scheme character is multiple
codepoints (grapheme-character "Latin Capital Letter A/Ring Above/Accent
Grave").

				Bear