Re: freshman-level Boyer-Moore fast string search William D Clinger 01 Aug 2005 14:32 UTC

Dr Alan Watson quoting me:
> > Suppose each string s is represented by a vector of 2^21 elements,
> > where element i consists of a list of numbers, in IEEE double
> > precision, that represent the indexes within s at which the
> > character c appears, where c is the Unicode scalar value f(i),
> > where f is represented by a global association list that maps
> > scalar values to indexes (i.e. f-inverse).  SRFI-75 allows that
> > representation, yet penalizes it.  I also consider it to be a
> > poor representation.
>
> You consider this a poor representation because it does not
> allow constant-time random access or because it does not allow
> linear-time traversal? Or simply because of the space cost?

Yes.  And don't forget the implementation complexity, cost of
mutation, potential for race conditions in multithreaded systems,
and incompatibility with Java, C#, and C++ cultural expectations.

> Consider an implementation that internally represents strings
> as if they were doubly-linked lists and cached the position of
> the last index. This allows linear-time traversal but not
> constant-time random access. Would you consider this a poor
> implementation?

For some purposes, perhaps, but not for others.  IMO it is far
more reasonable than the representation I described above.

BTW, I once used a doubly-linked-list-of-lines representation to
represent the text of editor buffers in an Emacs-like editor I
wrote in Scheme.  Some of the editor's commands treated the text
as a single long linear array of characters; other commands
treated the text as an array of strings, with double indexing
(line+column) to obtain a character.  Aided by a couple of cached
index translations, this representation ran acceptably fast on an
8 MHz 68000 with 1 Mby of RAM, and ran very nicely on the faster
and larger Macintoshes that came later.  It was never released as
a product, but I used it at home for more than a decade.

That experience is one of the reasons I disregard some of the
criticisms I have read here concerning SRFI-75's alleged bias
against interesting representations.

I have expressed my opinion that implementations should provide
multiple representations for strings, transparently and adaptively,
using the API described in SRFI-75 and to be described in future
SRFIs.  Not every one of those representations has to do all things
well.  Indeed, the impossibility of doing all things well with a
single representation is the reason I favor a multiplicity.

Will