Re: Surrogates and character representation
Tom Emerson 23 Jul 2005 17:10 UTC
Thomas Bushnell BSG writes:
> This is exactly part of the reason why char=codepoint is such a lose.
> Most code doesn't *want* to see this kind of garbage; it's an encoding
> issue. I want chars where the *computer* takes care of the coding. I
> want chars that are fully-understood characters, not little pieces of
> a character.
Surrogates are a side-effect of UTF-16. Period. Application-level code
just doesn't see them. This entire discussion about whether or not a
CHAR should include surrogate code points is, IMHO, a waste of
everyones talents here. It's much ado about nothing.
The only time you should see a surrogate value is if the input text is
malformed. Otherwise the lower-level transcoders should have converted
to the appropriate astral plan codepoint. If the text is malformed,
big deal. It is not difficult to handle this case.
FWIW, I've been working in Unicode since before UTF-16 was
developed. Most of my work is in Asian languages, where I would expect
to see characters outside the BMP. The reality is that they are just
not that commmon. You don't see them. The only time I do see them is
once in a while when dealing with texts from Hong Kong that are
encoded in UTF-16. But the transcoding layers makes these go away, and
I just have the full codepoint. If you are a developer and you lose
sleep over surrogates, I envy you.
-tree
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"