Re: Surrogates and character representation William D Clinger (27 Jul 2005 15:16 UTC)
Re: Surrogates and character representation Tom Emerson (27 Jul 2005 15:54 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:54 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:08 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 03:16 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:21 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 03:43 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:59 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 08:24 UTC)
Re: Surrogates and character representation Shiro Kawai (28 Jul 2005 10:06 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 15:34 UTC)
Re: Surrogates and character representation Tom Emerson (28 Jul 2005 16:48 UTC)
Re: Surrogates and character representation Alan Watson (28 Jul 2005 17:03 UTC)
Re: Surrogates and character representation bear (28 Jul 2005 22:36 UTC)
Re: Surrogates and character representation Alan Watson (29 Jul 2005 15:34 UTC)
Re: Surrogates and character representation John.Cowan (27 Jul 2005 16:16 UTC)
Re: Surrogates and character representation Per Bothner (28 Jul 2005 00:06 UTC)
Re: Surrogates and character representation John Cowan (28 Jul 2005 05:35 UTC)
Re: Surrogates and character representation Alan Watson (27 Jul 2005 17:47 UTC)
Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:46 UTC)

Re: Surrogates and character representation Tom Emerson 28 Jul 2005 16:48 UTC

Per Bothner writes:
> If you have large UTF-8 text files, clearly the most efficient solution
> is to use byte indexes.  That allows you to:
> (1) use random-access on the actual text files, without first reading
> them in in memory and expanding them to UTF-32.
> (2) map the file as-is into memory and index into the resulting buffer
> without any conversion of the data or the indexes.
> It follows that the most efficient internal representation is also
> UTF-8, since it matches the files, and allows you to use the same
> byte indexes without conversion.

Yes, this is great in theory, but the fact of the matter is that we
have to deal with data that isn't like this, and cannot be converted
to this. Again, as I said earlier, codepoint indexes are not tied to a
particular encoding. When getting data from multiple sources you have
to deal with these differences.

> This argument assumes you're willing to standardize on UTF-8 for
> your text files, which is a reasonable thing to do, but may be
> difficult to agree on.  If you don't agree that the canonical
> representation is UTF-8, then using character indexes may be better.

It is completely reasonable, but generally linguists (with some
exceptions) don't know and don't care about encodings. They don't
think about them: they exist below the level they are interested
in. They create data sets as they find convenient, and I have to work
with that.

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"