Re: Surrogates and character representation
John.Cowan 22 Jul 2005 04:09 UTC
Tom Emerson scripsit:
> If you treat the surrogates as undefined within the character range,
> then you must (for consistency) treat all of the other undefined
> abstract characters as holes. This just complicates processing.
All other undefined codepoints are potentially definable: they correspond
to Unicode scalar values. Surrogate codepoints are not definable and
don't correspond to any Unicode scalar value. The difference is
architectural.
> One question I've had: how are 8-bit (i.e., byte) strings handled
> here? Is there no distinction between operations on raw bytes and
> operations on characters?
Those things are not strings: they are vectors of unsigned 8-bit integers.
--
John Cowan xxxxxx@reutershealth.com http://www.ccil.org/~cowan
Is it not written, "That which is written, is written"?