Re: Issues with Unicode

Re: Issues with Unicode Jonathan S. Shapiro 09 May 2006 15:08 UTC
At Marc's request, I'm responding to the SRFI-91 list on the subject of
binary ports.

Overall, I think that this SRFI is quite good. That said, I think that
the port hierarchy proposal is misconceived and needs to be abandoned.
The design mistake lies in trying to unify binary ports and character
ports (even partially). Since the typing issues have been addressed in
prior discussion, I'm going to focus on semantic, implementation, and
experience issues here.

A note: I'm assuming in all of this that scheme will move to an
international character set. The problems I am about to discuss do not
manifest in a system implementing only a 7-bit or 8-bit character set.

Historical Background:

Historically, scheme has implicitly required that all ports be character
ports. The requirement that port I/O is character based is implicit in
the definitions of external representation and the relationship between
external representation, READ, and WRITE. The assumption was reinforced
with the introduction of string ports.

So: whatever else is done, we are stuck with character ports as a
fundamental abstraction provided by the language runtime.

One consequence of this is that external encodings cannot be neutral.
While it is possible to implement arbitrary external encodings on top of
READ-BYTE and WRITE-BYTE, there must be at least one multibyte character
encoding that is known to the runtime at primitive level in order to
support the implementations of READ and WRITE: the encoding that is the
one used for scheme program input.

* Mixing Bytes and Characters: Semantic issues

It is tempting to imagine that we might simply mix READ-BYTE and
READ-CHAR calls on a port. This breaks down quickly in the fact of
multibyte character sets. Consider the situation where the input stream
is pointed at a well-formed multibyte character sequence:

   U+4467 U+27

Now given the sequence:

  (READ-BYTE port)
  (READ-CHAR port)

what should READ-CHAR return? Note that the input cursor is not pointed
at the start of a character. Either read-byte must resynchronize the
stream (which is not possible in some character sets) or it must signal
an error (which suggests that mixing READ-BYTE and READ-CHAR was an
ill-conceived thing to permit). Resynchronizing is probably the right
answer, for many reasons having to do with recovery from malformed
input. Resynchronization behavior of READ-CHAR needs to be defined
independent of whether it is conflated with READ-BYTE.

If we conclude that resynchronizing is the right answer, then the SRFI
should be augmented with something like:

  (SYNC-TO-CHAR port)

which advances past bytes that cannot be the start of a character.
Having such a procedure is probably a good idea in any case.

But how about:

  (READ-BYTE port)
  (PEEK-CHAR port)
  (PEEK-BYTE port)

Assume that we can resynchronize. The whole point of PEEK-CHAR is that
it should not modify the input stream, but it is obliged to throw away
the bad characters while resynchronizing. So what does READ-BYTE return?
Should READ-CHAR do a non-destructive resynchronization? This is
consistent, but *extremely* expensive for reasons I will discuss below
under "implementation issues."

Finally, consider operations like:

  (let ((port (open-input-string "some-string")))
    (read-byte port))

does this imply the need to "explode" the string into bytes at the
implementation level? As I have been re-implementing tinyscheme, I have
been forced to conclude that it probably does.

Aside: SRFI-6, and the scheme standard in general, are *extremely*
sloppy about failing to specify interacting conditions. Things like:

  (let* ((s "abcdef")
         (p (open-input-string s)))
     (read-char p)
     (string-set! s 1 #\d)
     (read-char p) ; returns #\b or #\d??
    )

should be well defined, but are not. Is the port opened on a *copy* of
the string, or does it share state? If it shares state, note that
exploding the string into its constituent bytes in order to implement
READ-BYTE is a nightmare!

And of course, matters get *really* fun when we consider WRITE-BYTE:

  (let ((s (open-output-string)))
    (write-byte s 255)
    ...
    (get-output-string s))

Unless the sequence of written bytes happens to lead to valid character
encodings, it is unclear what get-output-string can safely return here.

* Mixing Bytes and Character: Implementation Issues

Assume, for the moment, a UTF-8 encoding for characters. Future scheme
implementations might support other encodings, but this one is certainly
one that should be able to work well.

If we have a resynchronization mechanism, and the implementation chooses
to provide legacy support for the ISO-10656 legacy character planes,
then the implementation must provide up to 11 characters of "push
back" (6 good ones and 5 bad ones). In the absence of legacy planes, I
believe (but check me) that 7 characters of push-back is sufficient.

The classic implementation of PEEK-CHAR in C has been

	ungetc(getc(f))

which is pleasant, because it avoids the need to re-implement STDIO from
the ground up. This is also pleasant because it is possible to share the
"file" abstraction with native code. If 7 or 11 characters of pushback
are required, implementations are now forced to rebuild STDIO.

The problem doesn't exist in the same way when bytes and characters are
not mixed, because wide-character implementations of STDIO already
exist, and have already extended the idea of push-back into wide
character sets. Note, however, that these implementations *rely* on a
modal separation between binary file descriptors and text file
descriptors.

* Implementation Experience

One or two of you may have noticed that I've taken over tinyscheme. I've
been working on various modernizations and fixes, among them the
introduction of unicode character sets. Tinyscheme is primarily a
scripting language, and so will deviate from scheme in some respects,
but I'm trying to keep it as close as I can within the requirements of
sensible and robust scripting practices.

In the course of this, I decided to see if I could come up with a
sensible implementation in which READ-BYTE and READ-CHAR could have
sensible behavior when both can be applied to the same port. It is a
complete mess.

A clean implementation does not always imply a clean semantics, but a
horrific implementation executed by a competent programmer (though
perhaps I flatter myself) is usually a sign of a poor design choice.

* A Caution

Something I said earlier deserves stronger emphasis: not all character
sets have the ability to resynchronize!

Unicode, I promise you, will not be the last character set in the
universe. It would be a great misfortune to tie the definition of the
language to an attribute (re-synchronizability) that future character
set designers are likely to get wrong.

Summary:

We need to add read-byte, write-byte, and friends, but we should firmly
segregate character ports and byte ports. Byte ports should NOT support
object I/O (in the form of READ/WRITE/DISPLAY, nor READ-CHAR). The
atomic unit of transfer in a byte port should be the byte. The atomic
unit of transfer in "classic" ports should be the character.

shap