Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 03:28 UTC)
Surrogates and character representation Tom Emerson (22 Jul 2005 03:55 UTC)
Re: Surrogates and character representation John.Cowan (22 Jul 2005 04:09 UTC)
Re: Surrogates and character representation Tom Emerson (22 Jul 2005 04:26 UTC)
Re: Surrogates and character representation Thomas Bushnell BSG (23 Jul 2005 07:19 UTC)
Re: Surrogates and character representation Tom Emerson (23 Jul 2005 17:38 UTC)
Re: Surrogates and character representation John.Cowan (24 Jul 2005 05:37 UTC)
Re: Surrogates and character representation Shiro Kawai (24 Jul 2005 08:15 UTC)
Re: Surrogates and character representation Tom Emerson (24 Jul 2005 13:25 UTC)
Re: Surrogates and character representation Alan Watson (24 Jul 2005 17:32 UTC)
Re: Surrogates and character representation Tom Emerson (24 Jul 2005 17:54 UTC)
Re: Surrogates and character representation Alan Watson (24 Jul 2005 18:15 UTC)
Re: Surrogates and character representation Tom Emerson (24 Jul 2005 20:18 UTC)
Re: Surrogates and character representation Per Bothner (24 Jul 2005 18:25 UTC)
Re: Surrogates and character representation John.Cowan (24 Jul 2005 23:02 UTC)
Re: Surrogates and character representation Per Bothner (24 Jul 2005 23:26 UTC)
Re: Surrogates and character representation Alan Watson (25 Jul 2005 17:24 UTC)
Re: Surrogates and character representation bear (27 Jul 2005 16:16 UTC)
Re: Surrogates and character representation John.Cowan (24 Jul 2005 22:12 UTC)
Re: Surrogates and character representation Ken Dickey (24 Jul 2005 09:35 UTC)
Re: Surrogates and character representation Michael Sperber (24 Jul 2005 11:47 UTC)
Re: the "Unicode Background" section Matthew Flatt (22 Jul 2005 04:30 UTC)
Re: the "Unicode Background" section Alex Shinn (22 Jul 2005 05:42 UTC)
Re: the "Unicode Background" section bear (22 Jul 2005 15:45 UTC)
Re: the "Unicode Background" section Tom Emerson (22 Jul 2005 15:56 UTC)

Re: Surrogates and character representation Shiro Kawai 24 Jul 2005 08:14 UTC

>From: "John.Cowan" <xxxxxx@reutershealth.com>
Subject: Re: Surrogates and character representation
Date: Sun, 24 Jul 2005 01:37:13 -0400

> but language/library designers (whose job it is to make corner cases
> unsuprising) do have to think about them.

Yes, but such library is working on the different domains.
Suppose the library has a function ucs->utf8.  It accepts a character,
and returns a sequence of octets, e.g.
  (ucs->utf8 #\u3042) => (#xe3 #x81 #x82)
If it returns (#\u00e3 #\u0081 #\u0082), I'd say there's something
wrong in it, it mixes up the domain and the range.
The same is true on ucs->utf16: It's type should be Char -> [Int16],
and unpaired surrogates appears as Int16.

The implementation can have #\ud800, as far as it defines the
behavior of expressions such as (ucs->utf16 #\ud800) or
(string-append "\ud800" "\udc00"), as well as I/O.   If we have
it in the standard, the standard should give definitions for those
expressions.   Do you think there's an agreeable and consistent
definition on handling these "characters"?  If not, it's better
to leave it unspecified.

(BTW, I am using a weird Scheme system that allows such invalid
"characters" in a string, and sometines it is handy, but it is ugly.)

--shiro