Re: freshman-level Boyer-Moore fast string search Alex Shinn 29 Jul 2005 06:05 UTC

On 7/29/05, William D Clinger <xxxxxx@verizon.net> wrote:
> Alex Shinn wrote:
>  > What is being done stupidly?  Any freshman textbook will include Sigma
>  > in the analysis of BM (at least mine, "Algorithms" by CL&R does, and
>  > all of the web analysis I Googled for did as well).
>
> You were arguing that UTF-8 "wins hands down for string searching".

I retract that, it was too strong a statement.  I would still personally
prefer UTF-8 for string searching any day, but acknowledge that this
is debatable.

> Had it been clear that you meant to say nothing more than that
> UTF-8 wins hands down, provided the implementor doesn't know any
> better than to use the naive representations that are often found
> in freshman textbooks, then I would not have responded to you.

I never suggested there weren't alternate algorithms.  But when you
make a change to an algorithm significant enough to alter the
asymptotic time, it is typical to refer to that as a variant of the
algorithm.  Regardless, I did acknowledge the variants, and I quoted
your own algorithms and your own running time analysis accurately,
and don't believe I deserve this hostility.

>  > Ah, so this uses a hash table instead of a binary tree.  For
>  > asymptotic notation this will remove the lg(m) factor, though
>  > it does incur some not insignificant constant overhead.
>
> I will assume you are smart enough to realize that a potentially
> three-stage lookup based on Unicode scalar values, which usually
> terminates in the first stage of lookup, is essentially similar
> to the potentially six-step lookup required for UTF-8, which
> usually involves only one step.

Well, UTF-8 is in theory 1-4 bytes, and in practice you almost never
see 4-byte codepoints, and 1 byte codepoints are extremely common.

Is the three-stage assumption for binary trees?  That would suggest
the search strings are never longer than 8 characters.  In practice I
frequently search for much longer strings - "call-with-current-continuation"
at 30 characters (5 stage lookup) is fairly common, and mult-line
searches are not unheard of.  You can't just dismiss a lg(m) factor.

If you're interested, I'll gladly benchmark a UTF-8 string search
against any UTF-32 string search you want to provide.

> > It seems you are suggesting non-fixed-length arrays as being
> > poor choices of representation.
>
> It seems you are not reading very carefully.

Out of curiosity, what string representations does SRFI-75 penalize
which you consider to be poor?

>  > Thus Bothner's statement still stands without a counter-example.
>
> What you mean, of course, is that Bothner's statement still stands
> without a counterexample you are willing to admit.

[...]

> In case you have forgotten Bothner's statement, here it is:
> "Random accesses to a position in a string that has not been
> previously accessed is not in itself useful."

You are correct, and I was mistaken, I was mentally translating this
to "Random accesses to a position in a string that has not been
previously accessed is not *necessary*," and was arguing strongly
on this point because of the previous draft of the SRFI which restricted
strings to character arrays.

"Usefulness" is fairly meaningless to argue about - people can find
a use for just about anything.  I will stick to the claim that random
access is not necessary, for performance or other reasons.

--
Alex