Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 18:54 UTC)
Re: the "Unicode Background" section John.Cowan (23 Jul 2005 07:57 UTC)

Re: the "Unicode Background" section Thomas Lord 22 Jul 2005 18:54 UTC

me: [don't exclude surrogates from CHAR, make basic
     unicode algorithms easy.  Like what you ask?
     Hmm... how about emitting a UTF-16 stream --
     isn't that a job for UTF-16?]

mflatt: [the plan is to add {WRITE,READ}-BYTE]

Another example would be a procedure
which reifies any number of the basic Unicode
databases.   I think that if you exclude surrogates,
you quickly get an Emacs Lisp-like situation of
having to have many character APIs that also
accept integers instead of characters.   Just seems
to get quickly icky and to undermine the disjointness
properties of the type system.

(On the other hand, it's good to see that byte-oriented
I/O will be added.)

mflatt: I'm not concerned about how to implement characters
        internally. I'm concerned with how to communicate
        with the rest of the world. [Therefore, allowing
        unpaired surrogates breaks interoperability --
        they can't be read or written using only well-formed
        standard encodings.]

Permitting unpaired surrogates does not damage interoperability
-- programs need only avoid trying to transmit them on channels
where strictly well-formed UTF-* is called for.  There are
many different approaches a program might take to that end --
which is the point: it's the business of applications, not
Scheme implementors.

I think that mostly the issue won't show up.  A program would
have to go out of its way to create an unpaired surrogate.
If a program is reading characters only from well-formed
streams, and not creating unpaired surrogates on its own,
then there is never an issue.   It also isn't hard to imagine
applications which internally create unpaired surrogates with
abandon (perhaps while doing character set computations, say)
but which, by their nature, are in no danger of ever trying to
transmit these on a strictly well-formed channel.

Also: that none of the currently popular encoding forms permits
unpaired surrogates is interesting but not normative to all
possible encoding forms.   Either future standards or application
specific encoding forms may, for good reasons, want to permit
the encoding of unpaired surrogates.

mflatt: This is another facet of how I was unclear. I'm less worried
        about rejecting ill-formed input than having to do something
        sensible on output. If standard Scheme allows programmers to
        output a string "\uD800", then my implementation will need to
        handle that case somehow.

In my view, DISPLAY (in R6RS, not forever) should be undefined in that
case (and in all cases where a string contains a non-8-bit-character) --
especially if you are adding {READ,WRITE}-BYTE and thereby enabling
experimentation in Unicode I/O APIs.

Of course, that leaves you with the question: what should *your*
implementation do.  Alas, I don't think it will have to do anything
it shouldn't be prepared to do otherwise.   Among the kinds of
real-world ports your implementation should be prepared to support
are ASCII-only and ISO-8859-*-only ports.   It would be about
as bad to write other characters (whether in UTF-* or not) on
those ports as it would to blindly apply the UTF-* encoding
algorithm to input like your example string.   Either your
implementation must have a DISPLAY which is strict -- in which
case you can signal an error, given that string -- or lax --
in which case it is up to implementations to avoid giving
it such input.

mflatt: [from anecdotes about experience]

  In any case, I removed the surrogates, but left the range extended.
  This first bit me when I started testing the GUI toolkit. For example,
  MzScheme handed the Mac toolbox a "UTF-8" encoded string with the
  "code point" #x10000000 in it, and the toolbox promptly complained,
  because it wasn't well-formed UTF-8. Output remained a problem for the
  same reason, of course, though that took me a little while longer to
  discover.

I don't see the problem there other than that other aspects of your
toolkit may have had features that invited this mis-use of an API that
is not total over the set of Scheme strings.

As evidence that other factors in your toolkit may have been the real
problem I'll cite GNU Emacs.  It, too, has "extra bits" in the character
set and plenty of APIs that have no great interpretation for what to
do with them.  The extra bits are fantastic for keymapping (translating
keystrokes into commands) but horrid for lots of other purposes (like
inserting a string into a text buffer).   My Emacs Lisp skills are
rusty but, afair, the solutions are a mix of things: errors in some
cases, brutally stripping extra bits in others, etc.   Seems to work
fine.

  The problem here isn't really about UTF-8, but the mismatch in
  definitions of character.

A more radical proposal for R6RS than mine would be to
add user-defined disjoint types, bytes and uniform arrays
of bytes, and subtract out characters and strings entirely.

As you say, "character" is a type with overloaded uses.
It's what keyboards and similar input devices produce.
It's what used to represent linguistic text in algorithms.
It's what used in transmissions including display to users.
All three have very different requirements yet, by convention,
we mush together a kind of "union type" out of their varying needs,
mostly because a subset of data from and to those uses tends to flow
from one of the uses to the others -- it's handy to have
a disjoint type for this big, by-convention, enumerated
type.

  My choices seemed to be to define a subset
  of strings that were allowed for GUI labels and such, or to fix the
  definition of character. The former seemed error prone (it wasn't
  clear how many places that would be necessary, both now and in the
  future), so I went with the latter.

I've been aiming for (in my proposals here) R6RS tweaks that would
at least come very close to leaving the implementation you describe
as conforming.   Your choices there seem like sane ones, especially
for a system chartered for education.

At the same time: do you know Emacs Lisp?  I think it provides
some insights about what can be "relaxed" and how without making
the resulting environment a nightmare, even for students.

-t