Re: Surrogates and character representation
Alex Shinn 28 Jul 2005 07:25 UTC
On 7/28/05, William D Clinger <xxxxxx@verizon.net> wrote:
>
> You certainly don't need character offsets to do a string
> search, but the naive algorithm without random access to
> characters is O(mn). The Boyer-Moore algorithm improves
> this to O(n/m) in many cases. I believe one can construct
> artificial examples to show that some O(n/m) cases would
> degrade to an intermediate complexity, or even back to O(mn),
> in UTF-8 or UTF-16 without character offsets.
This is not correct. Any search algorithm that works on bytes
will work on on UTF-8 strings. That is, given a C function that
searches for a char* within a char* (e.g. strstr) then that will
return the correct result if the arguments are UTF-8 encoded,
no matter what algorithm is used.
It is in fact UTF-32 that has additional overhead for Boyer-Moore
searches as mentioned in my previous mail.
--
Alex