Re: Surrogates and character representation William D Clinger (27 Jul 2005 15:16 UTC)
Re: Surrogates and character representation Tom Emerson (27 Jul 2005 15:54 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:54 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:08 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 03:16 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:21 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 03:43 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:59 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 08:24 UTC)
Re: Surrogates and character representation Shiro Kawai (28 Jul 2005 10:06 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 15:34 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 16:48 UTC)
Re: Surrogates and character representation Alan Watson (28 Jul 2005 17:03 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 22:36 UTC)
Re: Surrogates and character representation Alan Watson (29 Jul 2005 15:34 UTC)
Re: Surrogates and character representation John.Cowan (27 Jul 2005 16:16 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 00:06 UTC)
Re: Surrogates and character representation John Cowan (28 Jul 2005 05:35 UTC)
Re: Surrogates and character representation Alan Watson (27 Jul 2005 17:47 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:46 UTC)

Re: Surrogates and character representation Per Bothner 28 Jul 2005 03:43 UTC

Tom Emerson wrote:
> Let's look at how I handle these in Python right now: the UTF-8 data
> is read and transcoded to the internal Unicode string format.

Ah, so you're not doing random-access on "multimegabyte text files" as
as we assumed from  your initial message.

If you have the luxury of reading your entire file into memory (and in
the process expanding its size by a good bit) you can of course do all
kinds of processing and index-building.

It appears (from http://www.jorendorff.com/articles/unicode/python.html)
that Python unicode strings are UTF-16 strings, so character offsets
will break as soon as you go beyond the Basic Multilingual Plane.
Scheme implementations can of course fix this, though it means using
4 bytes per character.  Hence the discussion.
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/