Re: Surrogates and character representation

Show/hide message thread

Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 03:28 UTC)

Surrogates and character representation Tom Emerson (22 Jul 2005 03:55 UTC)

Re: Surrogates and character representation John.Cowan (22 Jul 2005 04:09 UTC)

Re: Surrogates and character representation Tom Emerson (22 Jul 2005 04:26 UTC)

Re: Surrogates and character representation Thomas Bushnell BSG (23 Jul 2005 07:19 UTC)

Re: Surrogates and character representation Tom Emerson (23 Jul 2005 17:38 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 05:37 UTC)

Re: Surrogates and character representation Shiro Kawai (24 Jul 2005 08:15 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 13:25 UTC)

Re: Surrogates and character representation Alan Watson (24 Jul 2005 17:32 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 17:54 UTC)

Re: Surrogates and character representation Alan Watson (24 Jul 2005 18:15 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 20:18 UTC)

Re: Surrogates and character representation Per Bothner (24 Jul 2005 18:25 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 23:02 UTC)

Re: Surrogates and character representation Per Bothner (24 Jul 2005 23:26 UTC)

Re: Surrogates and character representation Alan Watson (25 Jul 2005 17:24 UTC)

Re: Surrogates and character representation bear (27 Jul 2005 16:16 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 22:12 UTC)

Re: Surrogates and character representation Ken Dickey (24 Jul 2005 09:35 UTC)

Re: Surrogates and character representation Michael Sperber (24 Jul 2005 11:47 UTC)

Re: the "Unicode Background" section Matthew Flatt (22 Jul 2005 04:30 UTC)

Re: the "Unicode Background" section Alex Shinn (22 Jul 2005 05:42 UTC)

Re: the "Unicode Background" section bear (22 Jul 2005 15:45 UTC)

Re: the "Unicode Background" section Tom Emerson (22 Jul 2005 15:56 UTC)

Re: Surrogates and character representation Tom Emerson 24 Jul 2005 17:54 UTC

Alan Watson writes:
> Hmm. That would seem to prevent an implementation representing strings
> internally using UTF-8. This is convenient in some contexts as Scheme
> strings can be trivially converted to UTF-8 C strings.

You can create surrogate values in UTF-8, the result is just
ill-formed.  A conformant (Unicode) implementation shouldn't generate
these, though one could argue that if you get garbage-in, you get
garbage-out.

Scenario 1: You have a text stream encoded in UTF-16. It contains a
valid surrogate pair <D840,DD9B>. This is converted to the USV
#x0002019B. If you represent the Unicode strings internally as UTF-8,
this gets converted to the byte-sequence #xF0 #xA0 #x86 #x9B. When
writing the text stream you pick the encoding and the USV gets written
appropriately.

Scenario 2: You have a text stream encoded in UTF-16. It contains a
lone surrogate, <D840>. This is an invalid string. You have a couple
of options:

 2a: reject the input as invalid.

 2b: replace the surrogate value with the replacement character
     U+FFFD (converted to #xEF #xBF #xBD in UTF-8 rep land)

 2c: keep the character, encode internally in UTF-8 (#xED #xA1
     #xB0). On output this gets converted back.

 2d: ignore that value completely, not preserving it on input.

Of these, 2c is non-conforming and not recommended, but avoids data
loss in cases where that is important.

Representing strings internally in UTF-8 is a loss though, since you
lose random access to the string. For some applications this isn't a
big deal, but in general using UTF-8 as an internal representation is
a bad idea.

    -tree

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"