Re: Surrogates and character representation
Tom Emerson 28 Jul 2005 03:07 UTC
Alex Shinn writes:
> You're missing Per's point. Those features have to have been
> assigned by some previous text processing, which had to know
> the location in the text in order to choose a tag. Those locations
> could just as easily be represented by opaque pointers as by
> codepoint offsets. To store these pointers in a separate file they
> just need to be serializable. The obvious pointer representation
> for UTF-8 strings would be the byte offset, an integer, which
> serializes as is.
I'm not missing his point, actually. The stand-off markup may be
generated by someone else, say the data provider (in the case of data
acquired from the LDC or ELDA) and hence I do not have any Scheme
serialized data, rather character offsets into a UTF-8 scheme.
-tree
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"