Re: Surrogates and character representation Alex Shinn 28 Jul 2005 03:16 UTC
On 7/28/05, Tom Emerson <xxxxxx@basistech.com> wrote: > > I'm not missing his point, actually. The stand-off markup may be > generated by someone else, say the data provider (in the case of data > acquired from the LDC or ELDA) and hence I do not have any Scheme > serialized data, rather character offsets into a UTF-8 scheme. Do either of those actually supply UTF-32 files along with data files holding codepoint offsets? UTF-8 is by far the most common storage format for Unicode, and required by most network protocols. Regardless, this has nothing to do with strings. This involves seeking to a byte position in a file, and extracting (and optionally converting to the internal encoding) a chunk of text. -- Alex