Re: freshman-level Boyer-Moore fast string search William D Clinger 29 Jul 2005 05:32 UTC

Alex Shinn wrote:
 > What is being done stupidly?  Any freshman textbook will include Sigma
 > in the analysis of BM (at least mine, "Algorithms" by CL&R does, and
 > all of the web analysis I Googled for did as well).

You were arguing that UTF-8 "wins hands down for string searching".
Had it been clear that you meant to say nothing more than that
UTF-8 wins hands down, provided the implementor doesn't know any
better than to use the naive representations that are often found
in freshman textbooks, then I would not have responded to you.

 > Ah, so this uses a hash table instead of a binary tree.  For
 > asymptotic notation this will remove the lg(m) factor, though
 > it does incur some not insignificant constant overhead.

I will assume you are smart enough to realize that a potentially
three-stage lookup based on Unicode scalar values, which usually
terminates in the first stage of lookup, is essentially similar
to the potentially six-step lookup required for UTF-8, which
usually involves only one step.

 > It seems you are suggesting non-fixed-length arrays as being
 > poor choices of representation.

It seems you are not reading very carefully.

 > Indeed, without some sort of
 > string pointer or cursor object SRFI-75 does indeed penalize
 > UTF-8 and ropes.

I disagree.  SRFI-75 is neutral with respect to UTF-8 and ropes.
The scope of SRFI-75 is limited to support for Unicode in three
of Scheme's existing data types; it does not add new datatypes.
So far as I can tell, however, SRFI-75 is entirely compatible
with "some sort of string pointer or cursor object" proposal,
and I think it would be a good idea for someone to propose such
things in some other SRFI.

 > Thus Bothner's statement still stands without a counter-example.

What you mean, of course, is that Bothner's statement still stands
without a counterexample you are willing to admit.

To deny that the FLBM is a counterexample, you must deny that the
FLBM is a useful algorithm.  To deny that the real Boyer-Moore
algorithm is a counterexample, you must deny that it is a useful
algorithm.  You can't get out of that by insisting that the
Boyer-Moore algorithm is useful only when the strings being
searched are represented in UTF-8, as silly as that claim would
be, because it is easy to construct examples for which the
algorithm must access all of the bytes that encode every
character that it examines in the UTF-8 strings being searched.

In case you have forgotten Bothner's statement, here it is:
"Random accesses to a position in a string that has not been
previously accessed is not in itself useful."

Will