character strings versus byte strings
Matthew Flatt
(22 Dec 2003 14:16 UTC)
|
||
Re: character strings versus byte strings
Per Bothner
(22 Dec 2003 17:09 UTC)
|
||
Re: character strings versus byte strings
Matthew Flatt
(22 Dec 2003 17:23 UTC)
|
||
Re: character strings versus byte strings
tb@xxxxxx
(22 Dec 2003 20:23 UTC)
|
||
(missing)
|
||
(missing)
|
||
Re: character strings versus byte strings Tom Lord (22 Dec 2003 22:36 UTC)
|
||
Re: character strings versus byte strings
tb@xxxxxx
(22 Dec 2003 22:41 UTC)
|
||
Re: character strings versus byte strings
Shiro Kawai
(22 Dec 2003 23:00 UTC)
|
||
Re: character strings versus byte strings
Michael Sperber
(23 Dec 2003 09:36 UTC)
|
> From: xxxxxx@becket.net (Thomas Bushnell, BSG) > Tom Lord <xxxxxx@emf.net> writes: > > > Wrong. A Scheme character should be a codepoint. The representation > > > of code points as sequences of bytes should be under the hood. > > Misleading. > > It isn't obvious that Scheme characters should be _Unicode_ > > codepoints. For (much) more inclusive definitions of "codepoint", > > that characters should be codepoints is tautologically true. > Fair enough, though I think Unicode is the best choice at present. It > might be perfectly fine to leave that agnostic too. (If you don't > want specify even Unicode, then you certainly can't specify UTF-8!) You slightly misundertand. First of all, I agree that encoding schemes have no relation to the char type. There should be nothing, say, UTF-8- or UTF-16-specific about the char type. Second of all: I agree that Unicode is the best choice. I'd say it is the only realistic choice. I'd even say that it is a pleasant choice since Unicode is basically very well designed (excuse me a second while I duck the rotten tomatoes). The problem is that _given_unicode_, there is _still_ no definition of "character" that simultaneously makes sense for both the Scheme CHAR? type and from a Unicode perspective. It's a dainty task, at best, to avoid reflecting that bogosity in the FFI. > > There's a serious problem regarding Scheme and Unicode in that, for > > any sane definition of "character" in Unicode, the character type in > > R5RS is not sanely isomorphic. > I think there is a problem in that the R5RS character functions are > simply too simplistic, most notably in the case-mapping functions. Right. CHAR? necessarily has to come out as a very low-level type. A high-level interface is going to wind up being all about strings, where some strings are kind of "character-like" in some way or other. One problem I see is that implementations with different purposes will want to make the CHAR? type quite different from one another. For reasons I'm not yet getting into detail about here, I think that ultimately Scheme's CHAR? and STRING? types are doomed and that we're going to have to leave them underspecified and eventually unimportant (in favor of a new TEXT? type). > Case-mapping is a locale-dependent task; Yes and no. There is a locale-independent definition for it that is useful. > however difficult that may make the world, it's a fact of the > world. If I detach that sentence fragment from its context, I think it would serve well as an informal axiom for any discussion regarding unicode. > Many many many computer systems could get away with > ignoring the locale-dependency of case-mapping, but now they can > no longer plead ignorance. (Though the problems are hardly > obscure; even German causes problems.) (I think that, being a culturally unbiased person, you mean that German causes one _unique_ problem regarding case mapping.) > I would like to see Scheme DTRT, which means not creating a > foolish oversimplification. We have finally gotten away from > oversimplifying numbers; it's time to stop oversimplifying > characters too. Here here, cheers, and happy holidays. Now, to what extent to we want the SRFI-50 process to become that battleground vs. to what extent do we want it to step lightly around the issue :-) > We are stuck with R5RS at present, but we should at least not make > things worse. ! > I am happy to let others hash out the actual topic of this SRFI. My > concern is that the SRFI not start constraining Scheme in a bad > way, !! > and if you start saying things like "Scheme strings are UTF-8", I > start to get *really* nervous that someone is going to start making a > single codepoint take up multiple elements in a Scheme string. !!! -t