Re: Surrogates and character representation Tom Emerson 24 Jul 2005 17:54 UTC
Alan Watson writes: > Hmm. That would seem to prevent an implementation representing strings > internally using UTF-8. This is convenient in some contexts as Scheme > strings can be trivially converted to UTF-8 C strings. You can create surrogate values in UTF-8, the result is just ill-formed. A conformant (Unicode) implementation shouldn't generate these, though one could argue that if you get garbage-in, you get garbage-out. Scenario 1: You have a text stream encoded in UTF-16. It contains a valid surrogate pair <D840,DD9B>. This is converted to the USV #x0002019B. If you represent the Unicode strings internally as UTF-8, this gets converted to the byte-sequence #xF0 #xA0 #x86 #x9B. When writing the text stream you pick the encoding and the USV gets written appropriately. Scenario 2: You have a text stream encoded in UTF-16. It contains a lone surrogate, <D840>. This is an invalid string. You have a couple of options: 2a: reject the input as invalid. 2b: replace the surrogate value with the replacement character U+FFFD (converted to #xEF #xBF #xBD in UTF-8 rep land) 2c: keep the character, encode internally in UTF-8 (#xED #xA1 #xB0). On output this gets converted back. 2d: ignore that value completely, not preserving it on input. Of these, 2c is non-conforming and not recommended, but avoids data loss in cases where that is important. Representing strings internally in UTF-8 is a loss though, since you lose random access to the string. For some applications this isn't a big deal, but in general using UTF-8 as an internal representation is a bad idea. -tree -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"