Re: Surrogates and character representation Tom Emerson 22 Jul 2005 04:26 UTC
John.Cowan writes: > All other undefined codepoints are potentially definable: they correspond > to Unicode scalar values. Surrogate codepoints are not definable and > don't correspond to any Unicode scalar value. The difference is > architectural. FFFE is never (by architectural design) going to be defined either. Surrogate codepoints have a character property. They should be usable in a string, and individually can be considered a character. Most implementations won't see them: only library code that is reading/writing UTF-16 needs to worry about them in any significant way. Application code should not see them. They will see U+20069 as having the value 0x20069, not 0xD840DC69. In other words, I guess I'm saying that surrogates don't need to be special cased, because the existing Unicode property model accounts for them, and the generation/interpretation of them should be handled at a lower level. Special casing them just complicates everything for everyone. > > One question I've had: how are 8-bit (i.e., byte) strings handled > > here? Is there no distinction between operations on raw bytes and > > operations on characters? > > Those things are not strings: they are vectors of unsigned 8-bit integers. Of course. My Python hat is still on where 8-bit strings and Unicode strings are different beasts, and 8-bit strings are used for byte-strings. -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"