Re: character strings versus byte strings

Re: character strings versus byte strings tb@xxxxxx 22 Dec 2003 22:21 UTC
Tom Lord <xxxxxx@emf.net> writes:

>     > Wrong.  A Scheme character should be a codepoint.  The representation
>     > of code points as sequences of bytes should be under the hood.
>
> Misleading.
>
> It isn't obvious that Scheme characters should be _Unicode_
> codepoints.  For (much) more inclusive definitions of "codepoint",
> that characters should be codepoints is tautologically true.

Fair enough, though I think Unicode is the best choice at present.  It
might be perfectly fine to leave that agnostic too.  (If you don't
want specify even Unicode, then you certainly can't specify UTF-8!)

> There's a serious problem regarding Scheme and Unicode in that, for
> any sane definition of "character" in Unicode, the character type in
> R5RS is not sanely isomorphic.

I think there is a problem in that the R5RS character functions are
simply too simplistic, most notably in the case-mapping functions.

Case-mapping is a locale-dependent task; however difficult that may
make the world, it's a fact of the world.  Many many many computer
systems could get away with ignoring the locale-dependency of
case-mapping, but now they can no longer plead ignorance.  (Though the
problems are hardly obscure; even German causes problems.)

I would like to see Scheme DTRT, which means not creating a foolish
oversimplification.  We have finally gotten away from oversimplifying
numbers; it's time to stop oversimplifying characters too.

We are stuck with R5RS at present, but we should at least not make
things worse.

Ok, off that soapbox:

I am happy to let others hash out the actual topic of this SRFI.  My
concern is that the SRFI not start constraining Scheme in a bad way,
and if you start saying things like "Scheme strings are UTF-8", I
start to get *really* nervous that someone is going to start making a
single codepoint take up multiple elements in a Scheme string.

Thomas