Re: Surrogates and character representation
Alex Shinn 28 Jul 2005 01:53 UTC
On 7/28/05, Tom Emerson <xxxxxx@basistech.com> wrote:
> William D Clinger writes:
> Per Bothner wrote:
> > Random accesses to a position in a string that has not
> > been previously accessed is not in itself useful.
>
> In computational linguistics it is common to utilize standoff markup,
> where features in a text are tagged in a separate file via character
> ranges into the original. For example, we may have a file indicating
> that certain prepositional phrases appear at offsets [25,40) and
> [125,160) in the original file. I'm regularly dealing with
> multimegabyte text files with such standoff markup and not having
> random access is a detriment in these applications.
You're missing Per's point. Those features have to have been
assigned by some previous text processing, which had to know
the location in the text in order to choose a tag. Those locations
could just as easily be represented by opaque pointers as by
codepoint offsets. To store these pointers in a separate file they
just need to be serializable. The obvious pointer representation
for UTF-8 strings would be the byte offset, an integer, which
serializes as is.
--
Alex