Re: Surrogates and character representation Tom Emerson 28 Jul 2005 03:07 UTC
Alex Shinn writes: > You're missing Per's point. Those features have to have been > assigned by some previous text processing, which had to know > the location in the text in order to choose a tag. Those locations > could just as easily be represented by opaque pointers as by > codepoint offsets. To store these pointers in a separate file they > just need to be serializable. The obvious pointer representation > for UTF-8 strings would be the byte offset, an integer, which > serializes as is. I'm not missing his point, actually. The stand-off markup may be generated by someone else, say the data provider (in the case of data acquired from the LDC or ELDA) and hence I do not have any Scheme serialized data, rather character offsets into a UTF-8 scheme. -tree -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"