Re: freshman-level Boyer-Moore fast string search Alex Shinn 29 Jul 2005 03:12 UTC

On 7/29/05, William D Clinger <xxxxxx@verizon.net> wrote:
>
> Alex Shinn wrote:
>  > What you describe is not the Boyer-Moore algorithm.  It is a
>  > (freshman level variant of) a variant of the Boyer-Moore algorithm
>
> My message was perfectly clear about that.

It was perfectly clear that you were describing a "freshman level variant."
It was not perfectly clear that it was a variant of something that was
not the true BM.

>  > which trades a time and space cost of Sigma in the preprocessing
>  > step for a constant factor of lg(m) in both steps.
>
> Why do you continue to speak as though the Boyer-Moore algorithm's
> preprocessing must be done so stupidly?

What is being done stupidly?  Any freshman textbook will include Sigma
in the analysis of BM (at least mine, "Algorithms" by CL&R does, and
all of the web analysis I Googled for did as well).

That's one option - the textbook definition of the BM algorithm.  You
need to modify the algorithm if you want to get rid of Sigma.  You're
proposal used a binary tree to construct the lookup, and your own
analysis thus correctly factored in a lg(m) to the overall cost.  I
represented both of these approaches accurately in my summary.

> My description of the FLBM algorithm illustrates the basic idea of
> a less stupid representation, and you can find it worked out in
> more detail at http://www.codeproject.com/csharp/bmsearch.asp
> for the real Boyer-Moore algorithm.

Ah, so this uses a hash table instead of a binary tree.  For
asymptotic notation this will remove the lg(m) factor, though
it does incur some not insignificant constant overhead.

> If, as you say, the real Boyer-Moore algorithm can be made to run
> just as fast with UTF-8

Can be made to run?  Perhaps you're not grasping the fact that
no changes need to be made to an octet string search algorithm
to work properly with UTF-8.  Here is a concrete example of
a UTF-8 string search example in Chicken:

(define string-search
  (foreign-lambda* int ((c-string haystack) (c-string needle))
    "char *result = strstr(haystack, needle);
    return result ? (result - haystack) : -1;"))

(display (string-search "HERE IS A SIMPLE EXAMPLE" "EXAMPLE"))
(newline)

I used the C FFI to emphasize we're just calling out to a C
function that searches at the octet level.  This function could
easily use the standard BM algorithm, with no need to specially
handle UTF-8.  Do you not understand how the nature of UTF-8
encoding is such that this is a viable approach?

>  > However, the point of the discussion is that people are objecting to
>  > explicitly forcing strings to be character arrays.
>
> That may be *your* point, but it isn't *my* point.  So far as I can
> tell, absolutely no one has proposed, whether explicitly or implicitly,
> that strings be forced to be character arrays.  In particular, SRFI-75
> does not propose that.

The previous draft did in fact propose that, and was objected to:

  http://srfi.schemers.org/srfi-75/mail-archive/msg00055.html

I see the current draft has changed "fixed-length array" to "sequence".
I'm sorry I did not catch this change - perhaps a changelog would be
helpful.

However, from your previous comments:

  http://srfi.schemers.org/srfi-75/mail-archive/msg00250.html

    SRFI-75 doesn't dictate any particular representation,
    and that's good.  SRFI-75 does penalize certain poor choices of
    representation, and I think that's good too.

It seems you are suggesting non-fixed-length arrays as being
poor choices of representation.  Indeed, without some sort of
string pointer or cursor object SRFI-75 does indeed penalize
UTF-8 and ropes.

> I got into this because Per Bothner stated that "Random accesses to a
> position in a string that has not been previously accessed is not in
> itself useful."  I cited the Boyer-Moore string search to show that
> Bothner's statement is untrue.

The points are inherently related.  The reason random access is being
disputed is because it is the only advantage of fixed with string
representations.  I then pointed out that BM is in fact simpler in UTF-8,
and can use the exact same algorithm, therefore can be no slower.
We can dispute whether it is in fact faster till the cows come home,
because the real cost is in the average case time which depends on
the actual data used and hidden constants.

Thus Bothner's statement still stands without a counter-example.

--
Alex