Re: character strings versus byte strings

Re: character strings versus byte strings Tom Lord 22 Dec 2003 22:30 UTC
    > From: xxxxxx@becket.net (Thomas Bushnell, BSG)

    > Matthew Flatt <xxxxxx@cs.utah.edu> writes:

    > >  * For Scheme characters, pick a specific encoding, probably one of
    > >    UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right
    > >    choice).

    > Wrong.  A Scheme character should be a codepoint.  The representation
    > of code points as sequences of bytes should be under the hood.

Misleading.

It isn't obvious that Scheme characters should be _Unicode_
codepoints.  For (much) more inclusive definitions of "codepoint",
that characters should be codepoints is tautologically true.

There's a serious problem regarding Scheme and Unicode in that, for
any sane definition of "character" in Unicode, the character type in
R5RS is not sanely isomorphic.

I think that the best way to handle that in an FFI is to try to remain
agnostic about the range of the scheme CHAR? type when mapped into C.
I _guess_ that the error-signalling-on-range-error property of
SCHEME_EXTRACT_CHARACTER satisfies this but it could certainly be
rounded out and made more useful.

-t