Re: Surrogates and character representation Shiro Kawai 24 Jul 2005 08:14 UTC
>From: "John.Cowan" <xxxxxx@reutershealth.com> Subject: Re: Surrogates and character representation Date: Sun, 24 Jul 2005 01:37:13 -0400 > but language/library designers (whose job it is to make corner cases > unsuprising) do have to think about them. Yes, but such library is working on the different domains. Suppose the library has a function ucs->utf8. It accepts a character, and returns a sequence of octets, e.g. (ucs->utf8 #\u3042) => (#xe3 #x81 #x82) If it returns (#\u00e3 #\u0081 #\u0082), I'd say there's something wrong in it, it mixes up the domain and the range. The same is true on ucs->utf16: It's type should be Char -> [Int16], and unpaired surrogates appears as Int16. The implementation can have #\ud800, as far as it defines the behavior of expressions such as (ucs->utf16 #\ud800) or (string-append "\ud800" "\udc00"), as well as I/O. If we have it in the standard, the standard should give definitions for those expressions. Do you think there's an agreeable and consistent definition on handling these "characters"? If not, it's better to leave it unspecified. (BTW, I am using a weird Scheme system that allows such invalid "characters" in a string, and sometines it is handy, but it is ugly.) --shiro