Re: character strings versus byte strings
Tom Lord 22 Dec 2003 22:30 UTC
> From: xxxxxx@becket.net (Thomas Bushnell, BSG)
> Matthew Flatt <xxxxxx@cs.utah.edu> writes:
> > * For Scheme characters, pick a specific encoding, probably one of
> > UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right
> > choice).
> Wrong. A Scheme character should be a codepoint. The representation
> of code points as sequences of bytes should be under the hood.
Misleading.
It isn't obvious that Scheme characters should be _Unicode_
codepoints. For (much) more inclusive definitions of "codepoint",
that characters should be codepoints is tautologically true.
There's a serious problem regarding Scheme and Unicode in that, for
any sane definition of "character" in Unicode, the character type in
R5RS is not sanely isomorphic.
I think that the best way to handle that in an FFI is to try to remain
agnostic about the range of the scheme CHAR? type when mapped into C.
I _guess_ that the error-signalling-on-range-error property of
SCHEME_EXTRACT_CHARACTER satisfies this but it could certainly be
rounded out and made more useful.
-t