Re: Surrogates and character representation
Tom Emerson 28 Jul 2005 03:21 UTC
Alex Shinn writes:
> Do either of those actually supply UTF-32 files along with data
> files holding codepoint offsets? UTF-8 is by far the most common
> storage format for Unicode, and required by most network protocols.
Character offsets, irrespective of encoding. Generally these are UTF-8
encoded. If I have a Chinese file the first three characters will have
character offsets 0,1,2, but when encoded in UTF-8 these will be at 0,
3, 6. If, as is often the case, ASCII-range characters exist too, I
cannot assume any given underlying character width. I don't have byte
offsets. The standoff markup will work regardless of the character
encoding of the original file.
> Regardless, this has nothing to do with strings. This involves
> seeking to a byte position in a file, and extracting (and optionally
> converting to the internal encoding) a chunk of text.
I'm not sure how you can say that.
Let's look at how I handle these in Python right now: the UTF-8 data
is read and transcoded to the internal Unicode string format. From
there I can use the offsets read from the standoff markup to access
the characters directly. Very simple. All the ugly transcoding is done
at the library level: I don't worry about it. If the original file
isn't in UTF-8, but in say CP936, and I have the appropriate
transcoder to convert to the internal Unicode string, the offsets
continue to work.
-tree
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"