Re: Surrogates and character representation
Tom Emerson 28 Jul 2005 16:48 UTC
Per Bothner writes:
> If you have large UTF-8 text files, clearly the most efficient solution
> is to use byte indexes. That allows you to:
> (1) use random-access on the actual text files, without first reading
> them in in memory and expanding them to UTF-32.
> (2) map the file as-is into memory and index into the resulting buffer
> without any conversion of the data or the indexes.
> It follows that the most efficient internal representation is also
> UTF-8, since it matches the files, and allows you to use the same
> byte indexes without conversion.
Yes, this is great in theory, but the fact of the matter is that we
have to deal with data that isn't like this, and cannot be converted
to this. Again, as I said earlier, codepoint indexes are not tied to a
particular encoding. When getting data from multiple sources you have
to deal with these differences.
> This argument assumes you're willing to standardize on UTF-8 for
> your text files, which is a reasonable thing to do, but may be
> difficult to agree on. If you don't agree that the canonical
> representation is UTF-8, then using character indexes may be better.
It is completely reasonable, but generally linguists (with some
exceptions) don't know and don't care about encodings. They don't
think about them: they exist below the level they are interested
in. They create data sets as they find convenient, and I have to work
with that.
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"