Re: Surrogates and character representation Alex Shinn 28 Jul 2005 01:53 UTC
On 7/28/05, Tom Emerson <xxxxxx@basistech.com> wrote: > William D Clinger writes: > Per Bothner wrote: > > Random accesses to a position in a string that has not > > been previously accessed is not in itself useful. > > In computational linguistics it is common to utilize standoff markup, > where features in a text are tagged in a separate file via character > ranges into the original. For example, we may have a file indicating > that certain prepositional phrases appear at offsets [25,40) and > [125,160) in the original file. I'm regularly dealing with > multimegabyte text files with such standoff markup and not having > random access is a detriment in these applications. You're missing Per's point. Those features have to have been assigned by some previous text processing, which had to know the location in the text in order to choose a tag. Those locations could just as easily be represented by opaque pointers as by codepoint offsets. To store these pointers in a separate file they just need to be serializable. The obvious pointer representation for UTF-8 strings would be the byte offset, an integer, which serializes as is. -- Alex