Surrogates and character representation

Show/hide message thread

Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 03:28 UTC)

Surrogates and character representation Tom Emerson (22 Jul 2005 03:55 UTC)

Re: Surrogates and character representation John.Cowan (22 Jul 2005 04:09 UTC)

Re: Surrogates and character representation Tom Emerson (22 Jul 2005 04:26 UTC)

Re: Surrogates and character representation Thomas Bushnell BSG (23 Jul 2005 07:19 UTC)

Re: Surrogates and character representation Tom Emerson (23 Jul 2005 17:38 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 05:37 UTC)

Re: Surrogates and character representation Shiro Kawai (24 Jul 2005 08:15 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 13:25 UTC)

Re: Surrogates and character representation Alan Watson (24 Jul 2005 17:32 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 17:54 UTC)

Re: Surrogates and character representation Alan Watson (24 Jul 2005 18:15 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 20:18 UTC)

Re: Surrogates and character representation Per Bothner (24 Jul 2005 18:25 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 23:02 UTC)

Re: Surrogates and character representation Per Bothner (24 Jul 2005 23:26 UTC)

Re: Surrogates and character representation Alan Watson (25 Jul 2005 17:24 UTC)

Re: Surrogates and character representation bear (27 Jul 2005 16:16 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 22:12 UTC)

Re: Surrogates and character representation Ken Dickey (24 Jul 2005 09:35 UTC)

Re: Surrogates and character representation Michael Sperber (24 Jul 2005 11:47 UTC)

Re: the "Unicode Background" section Matthew Flatt (22 Jul 2005 04:30 UTC)

Re: the "Unicode Background" section Alex Shinn (22 Jul 2005 05:42 UTC)

Re: the "Unicode Background" section bear (22 Jul 2005 15:45 UTC)

Re: the "Unicode Background" section Tom Emerson (22 Jul 2005 15:56 UTC)

Surrogates and character representation Tom Emerson 22 Jul 2005 03:54 UTC

Just US$0.02 worth from the lurking depths.

Surrogates are no more than an elegant hack to extend the original
16-bit codespace to a 32-bit codespace. This talk of blocking the
surrogate blocks as the range of character values is silly, IMHO.

The implementation should be concerned with codepoints, in the range
0x000000 to 0x10FFFF. How these get mapped to bytes or words is an
issue with whatever transcoder you have in place to generate a
printable form of the abstract character.

Looking at characters this way, any codepoint in the range 0xD800
through 0xDFFF is considered in invalid character. This conforms with
section 3.8 of TUS, D26a and D27. These characters only show up when
dealing with UTF-16. UCS-4, UTF-32, UTF-8, etc. don't use them.

If you treat the surrogates as undefined within the character range,
then you must (for consistency) treat all of the other undefined
abstract characters as holes. This just complicates processing.

>From the programmer's perspective, I just want to deal with characters
as single entities (combining forms aside for the moment.) It is up to
me to knwo whether my string has been normalized or not, and deal with
that situation. For most uses it doesn't matter.

Using Unicode as the underlying character rep while using glyph
semantics at the program level is, to me, a recipe for complete
confusion. Then iteration over strings, and random string access,
becomes difficult: <0054 0073 0068 0075 0308 00DF> would then have
physical character indicies at 0, 1, 2, 3, 5.

One question I've had: how are 8-bit (i.e., byte) strings handled
here? Is there no distinction between operations on raw bytes and
operations on characters?

    -tree

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"