Re: Surrogates and character representation

Show/hide message thread

Re: Surrogates and character representation William D Clinger (27 Jul 2005 15:16 UTC)

Re: Surrogates and character representation Tom Emerson (27 Jul 2005 15:54 UTC)

Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:54 UTC)

Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:08 UTC)

Re: Surrogates and character representation Alex Shinn (28 Jul 2005 03:16 UTC)

Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:21 UTC)

Re: Surrogates and character representation Per Bothner (28 Jul 2005 03:43 UTC)

Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:59 UTC)

Re: Surrogates and character representation bear (28 Jul 2005 08:24 UTC)

Re: Surrogates and character representation Shiro Kawai (28 Jul 2005 10:06 UTC)

Re: Surrogates and character representation Per Bothner (28 Jul 2005 15:34 UTC)

Re: Surrogates and character representation Tom Emerson (28 Jul 2005 16:48 UTC)

Re: Surrogates and character representation Alan Watson (28 Jul 2005 17:03 UTC)

Re: Surrogates and character representation bear (28 Jul 2005 22:36 UTC)

Re: Surrogates and character representation Alan Watson (29 Jul 2005 15:34 UTC)

Re: Surrogates and character representation John.Cowan (27 Jul 2005 16:16 UTC)

Re: Surrogates and character representation Per Bothner (28 Jul 2005 00:06 UTC)

Re: Surrogates and character representation John Cowan (28 Jul 2005 05:35 UTC)

Re: Surrogates and character representation Alan Watson (27 Jul 2005 17:47 UTC)

Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:46 UTC)

Re: Surrogates and character representation Alex Shinn 28 Jul 2005 01:53 UTC

On 7/28/05, Tom Emerson <xxxxxx@basistech.com> wrote:
> William D Clinger writes:
> Per Bothner wrote:
> > Random accesses to a position in a string that has not
> > been previously accessed is not in itself useful.
>
> In computational linguistics it is common to utilize standoff markup,
> where features in a text are tagged in a separate file via character
> ranges into the original. For example, we may have a file indicating
> that certain prepositional phrases appear at offsets [25,40) and
> [125,160) in the original file. I'm regularly dealing with
> multimegabyte text files with such standoff markup and not having
> random access is a detriment in these applications.

You're missing Per's point.  Those features have to have been
assigned by some previous text processing, which had to know
the location in the text in order to choose a tag.  Those locations
could just as easily be represented by opaque pointers as by
codepoint offsets.  To store these pointers in a separate file they
just need to be serializable.  The obvious pointer representation
for UTF-8 strings would be the byte offset, an integer, which
serializes as is.

--
Alex