Re: freshman-level Boyer-Moore fast string search Alex Shinn 29 Jul 2005 06:05 UTC
On 7/29/05, William D Clinger <xxxxxx@verizon.net> wrote: > Alex Shinn wrote: > > What is being done stupidly? Any freshman textbook will include Sigma > > in the analysis of BM (at least mine, "Algorithms" by CL&R does, and > > all of the web analysis I Googled for did as well). > > You were arguing that UTF-8 "wins hands down for string searching". I retract that, it was too strong a statement. I would still personally prefer UTF-8 for string searching any day, but acknowledge that this is debatable. > Had it been clear that you meant to say nothing more than that > UTF-8 wins hands down, provided the implementor doesn't know any > better than to use the naive representations that are often found > in freshman textbooks, then I would not have responded to you. I never suggested there weren't alternate algorithms. But when you make a change to an algorithm significant enough to alter the asymptotic time, it is typical to refer to that as a variant of the algorithm. Regardless, I did acknowledge the variants, and I quoted your own algorithms and your own running time analysis accurately, and don't believe I deserve this hostility. > > Ah, so this uses a hash table instead of a binary tree. For > > asymptotic notation this will remove the lg(m) factor, though > > it does incur some not insignificant constant overhead. > > I will assume you are smart enough to realize that a potentially > three-stage lookup based on Unicode scalar values, which usually > terminates in the first stage of lookup, is essentially similar > to the potentially six-step lookup required for UTF-8, which > usually involves only one step. Well, UTF-8 is in theory 1-4 bytes, and in practice you almost never see 4-byte codepoints, and 1 byte codepoints are extremely common. Is the three-stage assumption for binary trees? That would suggest the search strings are never longer than 8 characters. In practice I frequently search for much longer strings - "call-with-current-continuation" at 30 characters (5 stage lookup) is fairly common, and mult-line searches are not unheard of. You can't just dismiss a lg(m) factor. If you're interested, I'll gladly benchmark a UTF-8 string search against any UTF-32 string search you want to provide. > > It seems you are suggesting non-fixed-length arrays as being > > poor choices of representation. > > It seems you are not reading very carefully. Out of curiosity, what string representations does SRFI-75 penalize which you consider to be poor? > > Thus Bothner's statement still stands without a counter-example. > > What you mean, of course, is that Bothner's statement still stands > without a counterexample you are willing to admit. [...] > In case you have forgotten Bothner's statement, here it is: > "Random accesses to a position in a string that has not been > previously accessed is not in itself useful." You are correct, and I was mistaken, I was mentally translating this to "Random accesses to a position in a string that has not been previously accessed is not *necessary*," and was arguing strongly on this point because of the previous draft of the SRFI which restricted strings to character arrays. "Usefulness" is fairly meaningless to argue about - people can find a use for just about anything. I will stick to the claim that random access is not necessary, for performance or other reasons. -- Alex