Re: Surrogates and character representation

Show/hide message thread

Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 03:28 UTC)

Surrogates and character representation Tom Emerson (22 Jul 2005 03:55 UTC)

Re: Surrogates and character representation John.Cowan (22 Jul 2005 04:09 UTC)

Re: Surrogates and character representation Tom Emerson (22 Jul 2005 04:26 UTC)

Re: Surrogates and character representation Thomas Bushnell BSG (23 Jul 2005 07:19 UTC)

Re: Surrogates and character representation Tom Emerson (23 Jul 2005 17:38 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 05:37 UTC)

Re: Surrogates and character representation Shiro Kawai (24 Jul 2005 08:15 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 13:25 UTC)

Re: Surrogates and character representation Alan Watson (24 Jul 2005 17:32 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 17:54 UTC)

Re: Surrogates and character representation Alan Watson (24 Jul 2005 18:15 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 20:18 UTC)

Re: Surrogates and character representation Per Bothner (24 Jul 2005 18:25 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 23:02 UTC)

Re: Surrogates and character representation Per Bothner (24 Jul 2005 23:26 UTC)

Re: Surrogates and character representation Alan Watson (25 Jul 2005 17:24 UTC)

Re: Surrogates and character representation bear (27 Jul 2005 16:16 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 22:12 UTC)

Re: Surrogates and character representation Ken Dickey (24 Jul 2005 09:35 UTC)

Re: Surrogates and character representation Michael Sperber (24 Jul 2005 11:47 UTC)

Re: the "Unicode Background" section Matthew Flatt (22 Jul 2005 04:30 UTC)

Re: the "Unicode Background" section Alex Shinn (22 Jul 2005 05:42 UTC)

Re: the "Unicode Background" section bear (22 Jul 2005 15:45 UTC)

Re: the "Unicode Background" section Tom Emerson (22 Jul 2005 15:56 UTC)

Re: Surrogates and character representation Tom Emerson 22 Jul 2005 04:26 UTC

John.Cowan writes:
> All other undefined codepoints are potentially definable: they correspond
> to Unicode scalar values.  Surrogate codepoints are not definable and
> don't correspond to any Unicode scalar value.  The difference is
> architectural.

FFFE is never (by architectural design) going to be defined
either.

Surrogate codepoints have a character property. They should be usable
in a string, and individually can be considered a character. Most
implementations won't see them: only library code that is
reading/writing UTF-16 needs to worry about them in any significant
way. Application code should not see them. They will see U+20069 as
having the value 0x20069, not 0xD840DC69.

In other words, I guess I'm saying that surrogates don't need to be
special cased, because the existing Unicode property model accounts
for them, and the generation/interpretation of them should be handled
at a lower level. Special casing them just complicates everything for
everyone.

> > One question I've had: how are 8-bit (i.e., byte) strings handled
> > here? Is there no distinction between operations on raw bytes and
> > operations on characters?
>
> Those things are not strings: they are vectors of unsigned 8-bit integers.

Of course. My Python hat is still on where 8-bit strings and Unicode
strings are different beasts, and 8-bit strings are used for
byte-strings.

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"