Email list hosting service & mailing list manager


Re: Surrogates and character representation William D Clinger 29 Jul 2005 05:29 UTC

Alex Shinn quoting me:

 > > You certainly don't need character offsets to do a string
 > > search, but the naive algorithm without random access to
 > > characters is O(mn). The Boyer-Moore algorithm improves
 > > this to O(n/m) in many cases. I believe one can construct
 > > artificial examples to show that some O(n/m) cases would
 > > degrade to an intermediate complexity, or even back to O(mn),
 > > in UTF-8 or UTF-16 without character offsets.
 >
 > This is not correct. Any search algorithm that works on bytes
 > will work on on UTF-8 strings. That is, given a C function that
 > searches for a char* within a char* (e.g. strstr) then that will
 > return the correct result if the arguments are UTF-8 encoded,
 > no matter what algorithm is used.

I acknowledged that the algorithm will still work.  My point
is that its asymptotic complexity may be degraded.

> It is in fact UTF-32 that has additional overhead for Boyer-Moore
> searches as mentioned in my previous mail.

Your claim was incorrect.  Your claim appears to depend upon
ignorance of techniques for constructing, representing, and
manipulating sparse automata efficiently, and upon your mistaken
belief that c1 > c2 implies O(m+c1) > O(m+c2), where c1 and c2
are constants.  Finally, you ignored the difference between O(n)
and O(n/m), which was the point of the paragraph you quoted above.

Per Bothner gave a much better response to this issue.

Will