Email list hosting service & mailing list manager

Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 03:28 UTC)
Surrogates and character representation Tom Emerson (22 Jul 2005 03:55 UTC)
Re: Surrogates and character representation John.Cowan (22 Jul 2005 04:09 UTC)
Re: Surrogates and character representation Tom Emerson (22 Jul 2005 04:26 UTC)
Re: Surrogates and character representation Thomas Bushnell BSG (23 Jul 2005 07:19 UTC)
Re: Surrogates and character representation Tom Emerson (23 Jul 2005 17:38 UTC)
Re: Surrogates and character representation John.Cowan (24 Jul 2005 05:37 UTC)
Re: Surrogates and character representation Shiro Kawai (24 Jul 2005 08:15 UTC)
Re: Surrogates and character representation Tom Emerson (24 Jul 2005 13:25 UTC)
Re: Surrogates and character representation Alan Watson (24 Jul 2005 17:32 UTC)
Re: Surrogates and character representation Tom Emerson (24 Jul 2005 17:54 UTC)
Re: Surrogates and character representation Alan Watson (24 Jul 2005 18:15 UTC)
Re: Surrogates and character representation Tom Emerson (24 Jul 2005 20:18 UTC)
Re: Surrogates and character representation Per Bothner (24 Jul 2005 18:25 UTC)
Re: Surrogates and character representation John.Cowan (24 Jul 2005 23:02 UTC)
Re: Surrogates and character representation Per Bothner (24 Jul 2005 23:26 UTC)
Re: Surrogates and character representation Alan Watson (25 Jul 2005 17:24 UTC)
Re: Surrogates and character representation bear (27 Jul 2005 16:16 UTC)
Re: Surrogates and character representation John.Cowan (24 Jul 2005 22:12 UTC)
Re: Surrogates and character representation Ken Dickey (24 Jul 2005 09:35 UTC)
Re: Surrogates and character representation Michael Sperber (24 Jul 2005 11:47 UTC)
Re: the "Unicode Background" section Matthew Flatt (22 Jul 2005 04:30 UTC)
Re: the "Unicode Background" section Alex Shinn (22 Jul 2005 05:42 UTC)
Re: the "Unicode Background" section bear (22 Jul 2005 15:45 UTC)
Re: the "Unicode Background" section Tom Emerson (22 Jul 2005 15:56 UTC)

Re: Surrogates and character representation Tom Emerson 23 Jul 2005 17:10 UTC

Thomas Bushnell BSG writes:
> This is exactly part of the reason why char=codepoint is such a lose.
> Most code doesn't *want* to see this kind of garbage; it's an encoding
> issue.  I want chars where the *computer* takes care of the coding.  I
> want chars that are fully-understood characters, not little pieces of
> a character.

Surrogates are a side-effect of UTF-16. Period. Application-level code
just doesn't see them. This entire discussion about whether or not a
CHAR should include surrogate code points is, IMHO, a waste of
everyones talents here. It's much ado about nothing.

The only time you should see a surrogate value is if the input text is
malformed. Otherwise the lower-level transcoders should have converted
to the appropriate astral plan codepoint. If the text is malformed,
big deal. It is not difficult to handle this case.

FWIW, I've been working in Unicode since before UTF-16 was
developed. Most of my work is in Asian languages, where I would expect
to see characters outside the BMP. The reality is that they are just
not that commmon. You don't see them. The only time I do see them is
once in a while when dealing with texts from Hong Kong that are
encoded in UTF-16. But the transcoding layers makes these go away, and
I just have the full codepoint. If you are a developer and you lose
sleep over surrogates, I envy you.

    -tree

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"