Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 18:54 UTC)
|
Re: the "Unicode Background" section
John.Cowan
(23 Jul 2005 07:57 UTC)
|
me: [don't exclude surrogates from CHAR, make basic unicode algorithms easy. Like what you ask? Hmm... how about emitting a UTF-16 stream -- isn't that a job for UTF-16?] mflatt: [the plan is to add {WRITE,READ}-BYTE] Another example would be a procedure which reifies any number of the basic Unicode databases. I think that if you exclude surrogates, you quickly get an Emacs Lisp-like situation of having to have many character APIs that also accept integers instead of characters. Just seems to get quickly icky and to undermine the disjointness properties of the type system. (On the other hand, it's good to see that byte-oriented I/O will be added.) mflatt: I'm not concerned about how to implement characters internally. I'm concerned with how to communicate with the rest of the world. [Therefore, allowing unpaired surrogates breaks interoperability -- they can't be read or written using only well-formed standard encodings.] Permitting unpaired surrogates does not damage interoperability -- programs need only avoid trying to transmit them on channels where strictly well-formed UTF-* is called for. There are many different approaches a program might take to that end -- which is the point: it's the business of applications, not Scheme implementors. I think that mostly the issue won't show up. A program would have to go out of its way to create an unpaired surrogate. If a program is reading characters only from well-formed streams, and not creating unpaired surrogates on its own, then there is never an issue. It also isn't hard to imagine applications which internally create unpaired surrogates with abandon (perhaps while doing character set computations, say) but which, by their nature, are in no danger of ever trying to transmit these on a strictly well-formed channel. Also: that none of the currently popular encoding forms permits unpaired surrogates is interesting but not normative to all possible encoding forms. Either future standards or application specific encoding forms may, for good reasons, want to permit the encoding of unpaired surrogates. mflatt: This is another facet of how I was unclear. I'm less worried about rejecting ill-formed input than having to do something sensible on output. If standard Scheme allows programmers to output a string "\uD800", then my implementation will need to handle that case somehow. In my view, DISPLAY (in R6RS, not forever) should be undefined in that case (and in all cases where a string contains a non-8-bit-character) -- especially if you are adding {READ,WRITE}-BYTE and thereby enabling experimentation in Unicode I/O APIs. Of course, that leaves you with the question: what should *your* implementation do. Alas, I don't think it will have to do anything it shouldn't be prepared to do otherwise. Among the kinds of real-world ports your implementation should be prepared to support are ASCII-only and ISO-8859-*-only ports. It would be about as bad to write other characters (whether in UTF-* or not) on those ports as it would to blindly apply the UTF-* encoding algorithm to input like your example string. Either your implementation must have a DISPLAY which is strict -- in which case you can signal an error, given that string -- or lax -- in which case it is up to implementations to avoid giving it such input. mflatt: [from anecdotes about experience] In any case, I removed the surrogates, but left the range extended. This first bit me when I started testing the GUI toolkit. For example, MzScheme handed the Mac toolbox a "UTF-8" encoded string with the "code point" #x10000000 in it, and the toolbox promptly complained, because it wasn't well-formed UTF-8. Output remained a problem for the same reason, of course, though that took me a little while longer to discover. I don't see the problem there other than that other aspects of your toolkit may have had features that invited this mis-use of an API that is not total over the set of Scheme strings. As evidence that other factors in your toolkit may have been the real problem I'll cite GNU Emacs. It, too, has "extra bits" in the character set and plenty of APIs that have no great interpretation for what to do with them. The extra bits are fantastic for keymapping (translating keystrokes into commands) but horrid for lots of other purposes (like inserting a string into a text buffer). My Emacs Lisp skills are rusty but, afair, the solutions are a mix of things: errors in some cases, brutally stripping extra bits in others, etc. Seems to work fine. The problem here isn't really about UTF-8, but the mismatch in definitions of character. A more radical proposal for R6RS than mine would be to add user-defined disjoint types, bytes and uniform arrays of bytes, and subtract out characters and strings entirely. As you say, "character" is a type with overloaded uses. It's what keyboards and similar input devices produce. It's what used to represent linguistic text in algorithms. It's what used in transmissions including display to users. All three have very different requirements yet, by convention, we mush together a kind of "union type" out of their varying needs, mostly because a subset of data from and to those uses tends to flow from one of the uses to the others -- it's handy to have a disjoint type for this big, by-convention, enumerated type. My choices seemed to be to define a subset of strings that were allowed for GUI labels and such, or to fix the definition of character. The former seemed error prone (it wasn't clear how many places that would be necessary, both now and in the future), so I went with the latter. I've been aiming for (in my proposals here) R6RS tweaks that would at least come very close to leaving the implementation you describe as conforming. Your choices there seem like sane ones, especially for a system chartered for education. At the same time: do you know Emacs Lisp? I think it provides some insights about what can be "relaxed" and how without making the resulting environment a nightmare, even for students. -t