Re: Surrogates and character representation Tom Emerson 28 Jul 2005 03:21 UTC
Alex Shinn writes: > Do either of those actually supply UTF-32 files along with data > files holding codepoint offsets? UTF-8 is by far the most common > storage format for Unicode, and required by most network protocols. Character offsets, irrespective of encoding. Generally these are UTF-8 encoded. If I have a Chinese file the first three characters will have character offsets 0,1,2, but when encoded in UTF-8 these will be at 0, 3, 6. If, as is often the case, ASCII-range characters exist too, I cannot assume any given underlying character width. I don't have byte offsets. The standoff markup will work regardless of the character encoding of the original file. > Regardless, this has nothing to do with strings. This involves > seeking to a byte position in a file, and extracting (and optionally > converting to the internal encoding) a chunk of text. I'm not sure how you can say that. Let's look at how I handle these in Python right now: the UTF-8 data is read and transcoded to the internal Unicode string format. From there I can use the offsets read from the standoff markup to access the characters directly. Very simple. All the ugly transcoding is done at the library level: I don't worry about it. If the original file isn't in UTF-8, but in say CP936, and I have the appropriate transcoder to convert to the internal Unicode string, the offsets continue to work. -tree -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"