Email list hosting service & mailing list manager

Re: Surrogates and character representation William D Clinger (28 Jul 2005 05:35 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 07:25 UTC)

Re: Surrogates and character representation Alex Shinn 28 Jul 2005 07:25 UTC

On 7/28/05, William D Clinger <xxxxxx@verizon.net> wrote:
>
> You certainly don't need character offsets to do a string
> search, but the naive algorithm without random access to
> characters is O(mn).  The Boyer-Moore algorithm improves
> this to O(n/m) in many cases.  I believe one can construct
> artificial examples to show that some O(n/m) cases would
> degrade to an intermediate complexity, or even back to O(mn),
> in UTF-8 or UTF-16 without character offsets.

This is not correct.  Any search algorithm that works on bytes
will work on on UTF-8 strings.  That is, given a C function that
searches for a char* within a char* (e.g. strstr) then that will
return the correct result if the arguments are UTF-8 encoded,
no matter what algorithm is used.

It is in fact UTF-32 that has additional overhead for Boyer-Moore
searches as mentioned in my previous mail.

--
Alex