Re: Surrogates and character representation
Per Bothner 28 Jul 2005 03:43 UTC
Tom Emerson wrote:
> Let's look at how I handle these in Python right now: the UTF-8 data
> is read and transcoded to the internal Unicode string format.
Ah, so you're not doing random-access on "multimegabyte text files" as
as we assumed from your initial message.
If you have the luxury of reading your entire file into memory (and in
the process expanding its size by a good bit) you can of course do all
kinds of processing and index-building.
It appears (from http://www.jorendorff.com/articles/unicode/python.html)
that Python unicode strings are UTF-16 strings, so character offsets
will break as soon as you go beyond the Basic Multilingual Plane.
Scheme implementations can of course fix this, though it means using
4 bytes per character. Hence the discussion.
--
--Per Bothner
xxxxxx@bothner.com http://per.bothner.com/