Re: freshman-level Boyer-Moore fast string search
Per Bothner 29 Jul 2005 06:42 UTC
William D Clinger wrote:
> Alex Shinn wrote:
> > Thus Bothner's statement still stands without a counter-example.
>
> What you mean, of course, is that Bothner's statement still stands
> without a counterexample you are willing to admit.
>
> To deny that the FLBM is a counterexample, you must deny that the
> FLBM is a useful algorithm. To deny that the real Boyer-Moore
> algorithm is a counterexample, you must deny that it is a useful
> algorithm. You can't get out of that by insisting that the
> Boyer-Moore algorithm is useful only when the strings being
> searched are represented in UTF-8, as silly as that claim would
> be, because it is easy to construct examples for which the
> algorithm must access all of the bytes that encode every
> character that it examines in the UTF-8 strings being searched.
>
> In case you have forgotten Bothner's statement, here it is:
> "Random accesses to a position in a string that has not been
> previously accessed is not in itself useful."
I might quibble that BM isn't really random access: It's really
sequential scanning, but sometimes you're able to skip ahead more
than a single character at a time. Also, note that character
indexes aren't inputs to the algorthm, - i.e. they aren't semenatically
meaningful. The discussion is whether the performance of one
particular algorithm (and its variations) is better when using a
UTF-32 representation or a UTF-8 representation. It's similar to
asking what kind of indexes would work best, which I'm also not
in a position to judge.
I won't deny that BM is useful, but I suspect it is *less* useful
now that people search XML files and other complex text data structures
with embedded controls and tree structure. Or they use special
indexes.
Also, remember that with today's hardware what matters performace-wise
are cache misses, and UTF-8 uses less memory that UTF-32. Any
theoretical performance advantage of UTF-32 for BM has to make up
for that fact. Even if UTF-8 has to access (say) twice as many bytes,
that doesn't matter if it ends up accessing fewer cache "words'.
--
--Per Bothner
xxxxxx@bothner.com http://per.bothner.com/