Re: Surrogates and character representation
Tom Emerson 24 Jul 2005 17:54 UTC
Alan Watson writes:
> Hmm. That would seem to prevent an implementation representing strings
> internally using UTF-8. This is convenient in some contexts as Scheme
> strings can be trivially converted to UTF-8 C strings.
You can create surrogate values in UTF-8, the result is just
ill-formed. A conformant (Unicode) implementation shouldn't generate
these, though one could argue that if you get garbage-in, you get
garbage-out.
Scenario 1: You have a text stream encoded in UTF-16. It contains a
valid surrogate pair <D840,DD9B>. This is converted to the USV
#x0002019B. If you represent the Unicode strings internally as UTF-8,
this gets converted to the byte-sequence #xF0 #xA0 #x86 #x9B. When
writing the text stream you pick the encoding and the USV gets written
appropriately.
Scenario 2: You have a text stream encoded in UTF-16. It contains a
lone surrogate, <D840>. This is an invalid string. You have a couple
of options:
2a: reject the input as invalid.
2b: replace the surrogate value with the replacement character
U+FFFD (converted to #xEF #xBF #xBD in UTF-8 rep land)
2c: keep the character, encode internally in UTF-8 (#xED #xA1
#xB0). On output this gets converted back.
2d: ignore that value completely, not preserving it on input.
Of these, 2c is non-conforming and not recommended, but avoids data
loss in cases where that is important.
Representing strings internally in UTF-8 is a loss though, since you
lose random access to the string. For some applications this isn't a
big deal, but in general using UTF-8 as an internal representation is
a bad idea.
-tree
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"