Re: Surrogates and character representation Alan Watson 27 Jul 2005 22:14 UTC

William D Clinger wrote:
> Referring to the Boyer-Moore fast string searching algorithm,
> Alan Watson wrote:
>  > Yes, but I think you can implement this for UTF-8 or UTF-16
>  > strings using offsets to the underlying bytes or shorts.  I
>  > don't think that you need character offsets.
>
> You certainly don't need character offsets to do a string
> search, but the naive algorithm without random access to
> characters is O(mn).  The Boyer-Moore algorithm improves
> this to O(n/m) in many cases.  I believe one can construct
> artificial examples to show that some O(n/m) cases would
> degrade to an intermediate complexity, or even back to O(mn),
> in UTF-8 or UTF-16 without character offsets.  I don't know
> how often examples of those cases would arise in practice.

n = string length
m = pattern length

I can see four cases when UTF-8 is the underlying representation:

(a) You have access to the underlying byte vector and you want a byte
index. O(n/m). Life is sweet.

(b) You have access to the underlying byte vector but you want a
character index. O(n/m) to find the byte index then O(n) to convert it
to a character index. Life is fairly sweet.

(c) You do not have access to the underlying byte vector, there is no
caching of character index to byte index conversions, and you want a
character index. O(n²/m), I think, because basically each character
index is an O(n) operation. You say (nm). Either way, life sucks.

(d) You do not have access to the underlying byte vector, the
implementation caches the last two character index to byte index
conversions, and you want a character index. O(n), I think. Life is
fairly sweet.

Case (d) works out not too badly because, I think, your next character
index is always just a few characters (up to m) from one of the last two
character indexes. Yes? I think you could even get away with the
implementation caching only the last index, provided it knows how to
search backwards as well as forwards from this point (pretty easy with
UTF-8).

> I just think
> it's a good idea to understand the efficiency issues before
> we dismiss them.  [...] SRFI-75 does penalize certain poor choices of
> representation, and I think that's good too.

Yes. I was simply making the point that UTF-8 is not such a losing
representation as one might think initially.

> I appreciate the fact that some implementations will want to
> use the same representation as some other language or system,
> even if that is not a particularly good representation.  From
> that point of view, I think the main problem with SRFI-75 is
> that it requires mutable strings, which (in the presence of
> concurrency or an obsession with object identity) make it hard
> to change the representation transparently---code written in
> some other language or in a concurrent thread might notice the
> change, even if the Scheme code in this thread doesn't.

Good point, but I think that if you impose an extra layer of
indirection, you might be able to solve these problems (at least for the
other language reading the Scheme string). For example, instead of
having the Scheme implementation say to C "here is a pointer to a
UTF-8/UTF-16/UCS-32 string that represents this Scheme string", you have
it say "here is a pointer to a pointer to ...". Ditto for the length.

Regards,

Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México