Email list hosting service & mailing list manager

Re: Surrogates and character representation William D Clinger (27 Jul 2005 15:16 UTC)
Re: Surrogates and character representation Tom Emerson (27 Jul 2005 15:54 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:54 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:08 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 03:16 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:21 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 03:43 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:59 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 08:24 UTC)
Re: Surrogates and character representation Shiro Kawai (28 Jul 2005 10:06 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 15:34 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 16:48 UTC)
Re: Surrogates and character representation Alan Watson (28 Jul 2005 17:03 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 22:36 UTC)
Re: Surrogates and character representation Alan Watson (29 Jul 2005 15:34 UTC)
Re: Surrogates and character representation John.Cowan (27 Jul 2005 16:16 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 00:06 UTC)
Re: Surrogates and character representation John Cowan (28 Jul 2005 05:35 UTC)
Re: Surrogates and character representation Alan Watson (27 Jul 2005 17:47 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:46 UTC)

Re: Surrogates and character representation bear 28 Jul 2005 22:35 UTC


On Thu, 28 Jul 2005, Alan Watson wrote:

>So, two questions:
>
>(1) Are your "random" accesses into your corpus linguistics strings
>really random, do they have significant locality, or could they be
>arranged to have have significant locality?

Speaking for myself, I would say they are as close to random as
makes no difference.  I typically suck the large string into
memory, pull in its indexes from another file, and then consult
my indexes for members of a particular synonym group and go to
fifty or five hundred locations in the string to gather details
about the context in which those words were used.

Now I could sort the accesses and do them from lowest to highest
offset, thus simulating locality.  But, particularly with relatively
rare words, the gaps between occurrences have poisson random
distribution, typically measured in megabytes.

The problem with doing this in terms of something other than
numeric offsets isn't locality though, not really; the problem
is serialization.  The corpus is a multi-megabyte object which
lives on the disk.  And none of the implementations of "marks"
I've seen has marks that persist across different instances
of the string, or are serializable.  There's a big upfront
investment in reading the corpus, recognizing words, parsing
sentences, and building indexes.  That's work I don't want to
repeat every time I pull the thing into memory, so having
done that, I want to be able to write the string (and the
indexes) and read the string and indexes back in when I'm
getting ready to do more work, and still have the indexes refer
to the correct places in the string.

>(2) Could you live with linear complexity to extract classes of substrings?

It would be a serious problem.  "Linear" becomes really onerous
when talking about long strings - one of the reasons I implemented
ropes for string representation.

				Bear