Re: Surrogates and character representation William D Clinger (27 Jul 2005 15:16 UTC)
Re: Surrogates and character representation Tom Emerson (27 Jul 2005 15:54 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:54 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:08 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 03:16 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:21 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 03:43 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:59 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 08:24 UTC)
Re: Surrogates and character representation Shiro Kawai (28 Jul 2005 10:06 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 15:34 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 16:48 UTC)
Re: Surrogates and character representation Alan Watson (28 Jul 2005 17:03 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 22:36 UTC)
Re: Surrogates and character representation Alan Watson (29 Jul 2005 15:34 UTC)
Re: Surrogates and character representation John.Cowan (27 Jul 2005 16:16 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 00:06 UTC)
Re: Surrogates and character representation John Cowan (28 Jul 2005 05:35 UTC)
Re: Surrogates and character representation Alan Watson (27 Jul 2005 17:47 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:46 UTC)

Re: Surrogates and character representation Tom Emerson 28 Jul 2005 03:21 UTC

Alex Shinn writes:
> Do either of those actually supply UTF-32 files along with data
> files holding codepoint offsets?  UTF-8 is by far the most common
> storage format for Unicode, and required by most network protocols.

Character offsets, irrespective of encoding. Generally these are UTF-8
encoded. If I have a Chinese file the first three characters will have
character offsets 0,1,2, but when encoded in UTF-8 these will be at 0,
3, 6. If, as is often the case, ASCII-range characters exist too, I
cannot assume any given underlying character width. I don't have byte
offsets. The standoff markup will work regardless of the character
encoding of the original file.

> Regardless, this has nothing to do with strings.  This involves
> seeking to a byte position in a file, and extracting (and optionally
> converting to the internal encoding) a chunk of text.

I'm not sure how you can say that.

Let's look at how I handle these in Python right now: the UTF-8 data
is read and transcoded to the internal Unicode string format. From
there I can use the offsets read from the standoff markup to access
the characters directly. Very simple. All the ugly transcoding is done
at the library level: I don't worry about it. If the original file
isn't in UTF-8, but in say CP936, and I have the appropriate
transcoder to convert to the internal Unicode string, the offsets
continue to work.

    -tree

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"