Email list hosting service & mailing list manager

Re: Surrogates and character representation William D Clinger (27 Jul 2005 15:16 UTC)
Re: Surrogates and character representation Tom Emerson (27 Jul 2005 15:54 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:54 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:08 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 03:16 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:21 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 03:43 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:59 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 08:24 UTC)
Re: Surrogates and character representation Shiro Kawai (28 Jul 2005 10:06 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 15:34 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 16:48 UTC)
Re: Surrogates and character representation Alan Watson (28 Jul 2005 17:03 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 22:36 UTC)
Re: Surrogates and character representation Alan Watson (29 Jul 2005 15:34 UTC)
Re: Surrogates and character representation John.Cowan (27 Jul 2005 16:16 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 00:06 UTC)
Re: Surrogates and character representation John Cowan (28 Jul 2005 05:35 UTC)
Re: Surrogates and character representation Alan Watson (27 Jul 2005 17:47 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:46 UTC)

Re: Surrogates and character representation Shiro Kawai 28 Jul 2005 10:06 UTC

>From: bear <xxxxxx@sonic.net>
Subject: Re: Surrogates and character representation
Date: Thu, 28 Jul 2005 01:24:15 -0700 (PDT)

> On Wed, 27 Jul 2005, Tom Emerson wrote:
>
> >Per Bothner writes:
> >> If you have the luxury of reading your entire file into memory (and in
> >> the process expanding its size by a good bit) you can of course do all
> >> kinds of processing and index-building.
> >
> >I have text files containing 100MB worth of UTF-8 encoded text with
> >character offsets in supplemental files. This happens regularly in
> >corpus linguistics.
>
> Uh, seconded.  Same reason (corpus linguistics).  There is no
> practical way to keep track of "marks" for hundreds of thousands
> (or millions) of interlinear annotations, and be able to serialize
> the string and read it back with marks intact. Numeric offsets do
> a better, more natural job.

Ok, I'm convinced that character index has advantage here.

I see that it's a matter of the place where you pay the cost.
Assuming the text files are in utf-8, if your internal
string is ucs32 array, you pay the cost while reading the text;
if your represent string by rope internally, you distribute the
cost for reading time and accessing time; if your internal string
is utf-8, you can just mmap the text file, but you pay the cost
at loading index data, when you perform some sort of index conversion.

When you apply lots of different index set onto a single text file,
the first choice (paying the cost at text reading time) wins.

--shiro