Re: Issues with Unicode

Re: Issues with Unicode Shiro Kawai 10 May 2006 02:28 UTC
Gauche has been supporting binary/character mixed port, as
well as multiple character encoding schemes, for several
years by now, and I have obtained some experience from it.

I agree that mixing them is a semantic mess.  The fundamental
issue here is that the port is an interface between the Scheme
world and the external world, and the external world IS a mess.
If you've written an application to process emails, collect
documents from web, or search documents in your harddisk,
you must know it---in general, you are given a file, and you
don't know which encoding it is in until you actually read
the content and examine it (you have to try several heuristics,
and even need to guess sometime).

The choice is either to make ports handle the mess, or to break
up the concept of ports into layers and allow mess-handling
code to be inserted between them.  I understand srfi-68 tried
the latter, which is cleaner IMHO, but has an efficiency issue.
Gauche chose the former (i.e. the native ports support char/binary
mixed I/O, as well as various codec conversion) because of
efficiency.

As the aim differ among implmentations, I don't like the basic
srfi to force an implementor to take a specific implementation
strategy.  Mandating char/binary mixed port would cause difficulty
in some implementations.  OTOH, mandating strict char/binary port
separation would also cause efficiency issue in some
implementations (see below for why one would want such mixed ports).

So, I like the basic srfi that doesn't mandate, but allows,
a char/binary mixed port.  The srfi can have a procedure that
may create a binary port from a char port, which can be something
like this:

  (char-port->binary-port <char-port> [<encoding>])
      => <binary-port>

If the implementation supports char/binary mixed port and
<encoding> matches its internal encoding, it can return
<char-port> itself, avoiding the overhead.

One use case of such mix is to read a document with the
following characteristics: you know the beginning of the
document consists of ASCII characters, which might contain
an explicit specification of the character set of the following
content.  If the beginning part doesn't contain character
set spec, you use some default encoding which is an extension
of ASCII.

Large number of documents out there falls into this category:
all XML and HTML documents, for example (as of HTML, I mean the
line like
<meta http-equiv="content-type" content="text/html; charset=euc-jp">)
Gauche also allows the Scheme program source contains charset
spec in the comment near the beginning of the program.

If you can mix char/binary port, you can create a port with a
default encoding, read the beginning part.  If it doesn't
have an explicit spec, presumably which is a typical case, you
can just use the port without overhead; if it has an explicit
char-set spec, you can insert some codec which reads the rest
of the input as binary.

--shiro