Email list hosting service & mailing list manager

Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 19:17 UTC)
Re: the "Unicode Background" section John.Cowan (22 Jul 2005 21:56 UTC)
Re: the "Unicode Background" section Shiro Kawai (22 Jul 2005 23:54 UTC)
Re: the "Unicode Background" section Shiro Kawai (22 Jul 2005 23:32 UTC)

Re: the "Unicode Background" section John.Cowan 22 Jul 2005 21:56 UTC

Thomas Lord scripsit:

> I think it might be realistic to label ports not with
> the encoding scheme they want, but with the set of
> code-values they can transmit -- in other words
> with their framing constraints.   In other words --
> a "UTF-8 port" (no such thing, really) and an "ASCII port"
> (no such thing, again) are *really* just "8-bit ports".
> A "UTF-16 port" is *really* just a "16-bit port".

The difficulty here is that an ISO-8859-1 port {produces,accepts} a
different set of characters from an ISO-8859-2 port.  Unless a port is
labeled with an encoding, you can't know what characters it will and
won't {accept,produce}, and you are stuck with some system default.
Even a 16-bit port behaves differently depending on whether it is
a UTF-16 port, a UTF-16LE port, or a UTF-16BE port.

I'm not saying that any Scheme system has to accept every possible
encoding (though I do think at least ASCII, UTF-8, and UTF-16 should
be mandatory; they are all trivial), but it needs to be possible
to specify the encoding of a port when it is created.  (I don't think
it's necessary to be able to change it on the fly, though.)

> At the same time, several of us agree that WRITE-CHAR
> should accept a CHAR argument which is, in essence, a
> codepoint.

In which case it is the output port's encoding that says what octets
to write.

> I think an implementation should be permitted to have a
> version of WRITE-CHAR which is not total for all PORT,
> CHAR pairs:  try to write a wide character on an 8-bit
> port and that's an error, etc.

Absolutely.  Or more specifically: attempt to write a character that's
not in the repertoire associated with the encoding is an error.
Allowing this to be lax is just asking for trouble.

Given that, it's easy to create a higher-level abstraction that will
{write,read} impossible characters with some encoding scheme.

--
Some people open all the Windows;       John Cowan
wise wives welcome the spring           xxxxxx@reutershealth.com
by moving the Unix.                     http://www.reutershealth.com
  --ad for Unix Book Units (U.K.)       http://www.ccil.org/~cowan
        (see http://cm.bell-labs.com/cm/cs/who/dmr/unix3image.gif)