Re: Surrogates and character representation

Show/hide message thread

Re: Surrogates and character representation William D Clinger (27 Jul 2005 15:16 UTC)

Re: Surrogates and character representation Tom Emerson (27 Jul 2005 15:54 UTC)

Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:54 UTC)

Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:08 UTC)

Re: Surrogates and character representation Alex Shinn (28 Jul 2005 03:16 UTC)

Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:21 UTC)

Re: Surrogates and character representation Per Bothner (28 Jul 2005 03:43 UTC)

Re: Surrogates and character representation Tom Emerson (28 Jul 2005 03:59 UTC)

Re: Surrogates and character representation bear (28 Jul 2005 08:24 UTC)

Re: Surrogates and character representation Shiro Kawai (28 Jul 2005 10:06 UTC)

Re: Surrogates and character representation Per Bothner (28 Jul 2005 15:34 UTC)

Re: Surrogates and character representation Tom Emerson (28 Jul 2005 16:48 UTC)

Re: Surrogates and character representation Alan Watson (28 Jul 2005 17:03 UTC)

Re: Surrogates and character representation bear (28 Jul 2005 22:36 UTC)

Re: Surrogates and character representation Alan Watson (29 Jul 2005 15:34 UTC)

Re: Surrogates and character representation John.Cowan (27 Jul 2005 16:16 UTC)

Re: Surrogates and character representation Per Bothner (28 Jul 2005 00:06 UTC)

Re: Surrogates and character representation John Cowan (28 Jul 2005 05:35 UTC)

Re: Surrogates and character representation Alan Watson (27 Jul 2005 17:47 UTC)

Re: Surrogates and character representation Alex Shinn (28 Jul 2005 01:46 UTC)

Re: Surrogates and character representation Per Bothner 28 Jul 2005 15:33 UTC

If you have large UTF-8 text files, clearly the most efficient solution
is to use byte indexes.  That allows you to:
(1) use random-access on the actual text files, without first reading
them in in memory and expanding them to UTF-32.
(2) map the file as-is into memory and index into the resulting buffer
without any conversion of the data or the indexes.
It follows that the most efficient internal representation is also
UTF-8, since it matches the files, and allows you to use the same
byte indexes without conversion.

This argument assumes you're willing to standardize on UTF-8 for
your text files, which is a reasonable thing to do, but may be
difficult to agree on.  If you don't agree that the canonical
representation is UTF-8, then using character indexes may be better.

Another argument for using codepoint offsets rather than byte offsets
is if they're going to be used by humans, perhaps in email or journal
articles, since people unfamiliar with UTF-8 may be confused by UTF-8
offsets.  However, this is a fairly weak argument, since you have the
same issue with composite characters: in that case codepoint offsets
will also not match the characters that people see.
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/