Re: Surrogates and character representation
Alex Shinn 28 Jul 2005 03:16 UTC
On 7/28/05, Tom Emerson <xxxxxx@basistech.com> wrote:
>
> I'm not missing his point, actually. The stand-off markup may be
> generated by someone else, say the data provider (in the case of data
> acquired from the LDC or ELDA) and hence I do not have any Scheme
> serialized data, rather character offsets into a UTF-8 scheme.
Do either of those actually supply UTF-32 files along with data
files holding codepoint offsets? UTF-8 is by far the most common
storage format for Unicode, and required by most network protocols.
Regardless, this has nothing to do with strings. This involves
seeking to a byte position in a file, and extracting (and optionally
converting to the internal encoding) a chunk of text.
--
Alex