Re: Issues with Unicode Jonathan S. Shapiro 09 May 2006 16:52 UTC

[By the way, my profound apologies for failing to update the subject

On Tue, 2006-05-09 at 11:32 -0400, John Cowan wrote:
> A few nits to an otherwise well-reasoned argument:
> Jonathan S. Shapiro scripsit:
> > A note: I'm assuming in all of this that scheme will move to an
> > international character set. The problems I am about to discuss do not
> > manifest in a system implementing only a 7-bit or 8-bit character set.
> But they do manifest quite well in 16-bit and 24-bit national character
> sets, so even avoiding Unicode doesn't avoid the problem.

I agree completely.

> > We need to add read-byte, write-byte, and friends, but we should firmly
> > segregate character ports and byte ports. Byte ports should NOT support
> > object I/O (in the form of READ/WRITE/DISPLAY, nor READ-CHAR). The
> > atomic unit of transfer in a byte port should be the byte. The atomic
> > unit of transfer in "classic" ports should be the character.
> I agree absolutely, and would add:
> We need standard procedures that take a byte port and a representation of
> an encoding and return a character port.

I agree that this would be nice to have, but I think that the presence
of PEEK-BYTE and PEEK-CHAR makes this problematic because of the need
for multibyte lookahead. Further, I don't think that this can be
implemented correctly as a non-primitive mechanism. Here are the issues
that I see:

1. Can you suggest an feasible implementation that does not demand 7
bytes of pushback on the byte port?

2. If the character port is an overlay on the byte port, then problems
will arise in concurrent implementations. It will become necessary for
the character port implementation to obtain a lock on the byte port so
that no calls to READ-BYTE or PEEK-BYTE from a second thread are allowed
to interleave.

The second point has exceptionally unpleasant consequences if the reader
in the lock-holding thread manages to exhaust heap space without
completing the operation.

In addition to this, there is another issue: we should not inadvertently
mandate that there should be no embedded scheme implementations.
Realizing your desire implies that the scheme runtime must carry some
*very* large compiled-in tables. Other cases might be omitted from a
given implementation, but the proposal to support UTF-8 encoded unicode
drags in many *megabytes*.

This is not an issue with character ports per se, but it *is* an issue
raised by READ and case-insensitive symbol name matching. The downcasing
(or, if preferred, upcasing) rule tables are several megabytes.

My preference would be to resolve this by declaring that R6RS is going
to make a break and use case-sensitive symbol matching, but this will
undoubtedly provoke holy wars on both sides. In keeping with this, I
would actually like to remove the -ci- comparison operations from the
core and relocate these to a library.

So: if your proposal is to be implemented, I think that it should be in
a library, not in the core, and I think it demands some consideration of
a reconciliation of multithreading and PEEK-CHAR/PEEK-BYTE.

My opinion on that: don't reconcile them, acknowledge that the use case
in which the byte port will remain accessable is rare, and leave people
who are engaged in multithreading to implement their own
thread-respecting wrappers around raw byte ports.

Finally, do *not* allow the standard input and output ports to be byte