Re: Surrogates and character representation
Tom Emerson 28 Jul 2005 03:59 UTC
Per Bothner writes:
> If you have the luxury of reading your entire file into memory (and in
> the process expanding its size by a good bit) you can of course do all
> kinds of processing and index-building.
I have text files containing 100MB worth of UTF-8 encoded text with
character offsets in supplemental files. This happens regularly in
corpus linguistics.
> It appears (from http://www.jorendorff.com/articles/unicode/python.html)
> that Python unicode strings are UTF-16 strings, so character offsets
> will break as soon as you go beyond the Basic Multilingual Plane.
> Scheme implementations can of course fix this, though it means using
> 4 bytes per character. Hence the discussion.
Yes, it falls apart with Astral plane characters, but these are
fortunately rare. When you build the Python interpreter you can set
the size of internal Unicode characters: 2-bytes or 4-bytes. I use a
4-byte Unicode build of the interpreter when I deal with Astral plane.
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"