Re: Surrogates and character representation Alex Shinn 28 Jul 2005 01:45 UTC
On 7/28/05, Alan Watson <xxxxxx@astrosmo.unam.mx> wrote: > William D Clinger wrote: > > Per Bothner wrote: > > > Random accesses to a position in a string that has not > > > been previously accessed is not in itself useful. > > > > Untrue. The Boyer-Moore algorithm for fast string > > searching uses random accesses to positions that have > > not been previously accessed [1]. > > Yes, but I think you can implement this for UTF-8 or UTF-16 strings > using offsets to the underlying bytes or shorts. I don't think that you > need character offsets. True. In fact any existing Boyer-Moore search that works at the 8-bit byte level will work unmodified on UTF-8 strings. Furthermore, the true cost of BM includes a preprocessing step of O(m+sigma) in both time and space, where m is the pattern length and sigma is the size of the alphabet. For the byte-level UTF-8 search, sigma is 256, but for UTF-32 sigma is 2^21 = 2097152. For patterns that are going to be repeatedly searched, it is reasonable to keep the table for UTF-8 in memory, but the table for UTF-32, if using a SRFI-4 u32vector for offsets, would incur an extra 8MB of memory per pattern. This isn't reasonable - in practice you'll want to perform the preprocessing step every time for UTF-32. UTF-8 wins hands down for string searching. -- Alex