Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 03:28 UTC)
Surrogates and character representation Tom Emerson (22 Jul 2005 03:55 UTC)
Re: Surrogates and character representation John.Cowan (22 Jul 2005 04:09 UTC)
Re: Surrogates and character representation Tom Emerson (22 Jul 2005 04:26 UTC)
Re: Surrogates and character representation Thomas Bushnell BSG (23 Jul 2005 07:19 UTC)
Re: Surrogates and character representation Tom Emerson (23 Jul 2005 17:38 UTC)
Re: Surrogates and character representation John.Cowan (24 Jul 2005 05:37 UTC)
Re: Surrogates and character representation Shiro Kawai (24 Jul 2005 08:15 UTC)
Re: Surrogates and character representation Tom Emerson (24 Jul 2005 13:25 UTC)
Re: Surrogates and character representation Alan Watson (24 Jul 2005 17:32 UTC)
Re: Surrogates and character representation Tom Emerson (24 Jul 2005 17:54 UTC)
Re: Surrogates and character representation Alan Watson (24 Jul 2005 18:15 UTC)
Re: Surrogates and character representation Tom Emerson (24 Jul 2005 20:18 UTC)
Re: Surrogates and character representation Per Bothner (24 Jul 2005 18:25 UTC)
Re: Surrogates and character representation John.Cowan (24 Jul 2005 23:02 UTC)
Re: Surrogates and character representation Per Bothner (24 Jul 2005 23:26 UTC)
Re: Surrogates and character representation Alan Watson (25 Jul 2005 17:24 UTC)
Re: Surrogates and character representation bear (27 Jul 2005 16:16 UTC)
Re: Surrogates and character representation John.Cowan (24 Jul 2005 22:12 UTC)
Re: Surrogates and character representation Ken Dickey (24 Jul 2005 09:35 UTC)
Re: Surrogates and character representation Michael Sperber (24 Jul 2005 11:47 UTC)
Re: the "Unicode Background" section Matthew Flatt (22 Jul 2005 04:30 UTC)
Re: the "Unicode Background" section Alex Shinn (22 Jul 2005 05:42 UTC)
Re: the "Unicode Background" section bear (22 Jul 2005 15:45 UTC)
Re: the "Unicode Background" section Tom Emerson (22 Jul 2005 15:56 UTC)

Re: the "Unicode Background" section Thomas Lord 22 Jul 2005 03:28 UTC

At Thu, 21 Jul 2005 15:45:34 -0700, Thomas Lord wrote:
>> If CHARs are codepoints, more basic Unicode algorithms translate
>> into Scheme cleanly.

> I don't see what you mean. Can you provide an example?

How about: Emitting a UTF-16 encoded stream of the contents
of a string?   Doesn't that sound like an application for
WRITE-CHAR?   Or is that the kind of thing one shouldn't
be able to do in portable Scheme?

>> What is gained by forcing surrogates to be unrepresentable as CHAR?

> Every string is representable in UTF-8, UTF-16, etc.

You are concerned about sequences containing isolated (unpaired)
surrogates and their implications for string algebra.  Your
concerns are entirely reducible to a concern with UTF-16 --
in all other encodings, there is no ambiguity.

So... how can we represent a string containing an isolated
surrogate in UTF-16?   One idea is for an implementation
to privately allocate a range of characters for that purpose.
Stuffing an isolated surrogate into a string in such an
implementation may result in storing 32-bytes (a surrogate
pair encoding an isolated surrogate) but so what?  There
are other techniques available too.

In fact, it would be a MINOR arbitrary limitation of a
conforming implementation (according to your own standards
of what's important, evidenced by the draft) if that implementation
simply aborted when an attempt to read or form an isolated
surrogate happened.  Why, then, would the standard bother
to forbid it?

>> What kind of code will I wind up with if I want to iterate over
>> a large range of CHAR values?

> Two loops: one from 0 to #xD7FF, and one from #xE000 to #x10FFFF.

I'm not sure what to say other than that I don't see why you
are comfortable with that.  Surely people will want to paper
that over and the net result will be what I suggested that you
did not quote: we'll wind up with a separate set of APIs to
cope with character arithmetic -- odd since arithmetic is just
arithmetic no matter how you spell it.

>> It's not as if by excluding surrogates we arrive at a CHAR definition
>> that is significantly more "linguistic" than if we don't.

> True, but we arrive at a definition that is more standards-friendly,

I don't know what you mean by "standards-friendly" here.

> FWIW: MzScheme originally supported a larger set of characters, mainly
> because extra bits are available my implementation. The resulting bad
> experience convinced me to define characters in terms of scalar
> values, instead.

I don't see your point.  I don't see what "extra bits" have to do with
surrogates.  You also don't explain why a set of characters larger
than "Unicode scalar values" caused a bad experience and I don't take
your word for it (maybe you guys made some other mistake that was
the *real* cause of the problems you encountered -- maybe you
misidentified the issues -- I can't tell from your account).

-t

Re: the "Unicode Background" section (unknown) 21 Jul 2005 23:52 UTC

At Thu, 21 Jul 2005 15:45:34 -0700, Thomas Lord wrote:
> If CHARs are codepoints, more basic Unicode algorithms translate
> into Scheme cleanly.

I don't see what you mean. Can you provide an example?

> What is gained by forcing surrogates to be unrepresentable as CHAR?

Every string is representable in UTF-8, UTF-16, etc.

> What kind of code will I wind up with if I want to iterate over
> a large range of CHAR values?

Two loops: one from 0 to #xD7FF, and one from #xE000 to #x10FFFF.

> It's not as if by excluding surrogates we arrive at a CHAR definition
> that is significantly more "linguistic" than if we don't.

True, but we arrive at a definition that is more standards-friendly,
and that's part of the overall compromise.

FWIW: MzScheme originally supported a larger set of characters, mainly
because extra bits are available my implementation. The resulting bad
experience convinced me to define characters in terms of scalar values,
instead.

Matthew