Email list hosting service & mailing list manager

Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 19:17 UTC)
Re: the "Unicode Background" section John.Cowan (22 Jul 2005 21:56 UTC)
Re: the "Unicode Background" section Shiro Kawai (22 Jul 2005 23:54 UTC)
Re: the "Unicode Background" section Shiro Kawai (22 Jul 2005 23:32 UTC)

Re: the "Unicode Background" section Shiro Kawai 22 Jul 2005 23:54 UTC

>From: "John.Cowan" <xxxxxx@reutershealth.com>
Subject: Re: the "Unicode Background" section
Date: Fri, 22 Jul 2005 17:56:00 -0400

> I'm not saying that any Scheme system has to accept every possible
> encoding (though I do think at least ASCII, UTF-8, and UTF-16 should
> be mandatory; they are all trivial), but it needs to be possible
> to specify the encoding of a port when it is created.  (I don't think
> it's necessary to be able to change it on the fly, though.)

Changing encodings in a port may come handy in a couple of very
practical situation:

- Parsing RFC2822 and/or MIME messages (the header is ASCII,
  and the content's charset is specified in the header)

- Parsing documents that have encoding specification near the
  beginning of it (e.g. <?xml version="1.0" encoding="utf-8"?>,
  or the "coding: utf-8" magic comment to specify source-code
  encoding).

Both can be handled by layering ports, i.e. first you can use an
ascii port on top of binary port to find necessary info, then
create a new port with desired encoding on top of the original
binary port to suck the content.  You need to be careful about
buffering, though.  And some may dislike the overhead of layering.
But that's out of scope of the discussion.

> Absolutely.  Or more specifically: attempt to write a character that's
> not in the repertoire associated with the encoding is an error.
> Allowing this to be lax is just asking for trouble.

I mentioned some other options in my reply to Tom Lord, but
there's one practical example:

Suppose I have a dynamic website which can store Unicode document.
My cgi script uses a CES-conversion port in its output so that
it can send out the document in CES specified by the web browser.
When one iso8859-1 browser ask a content which has chinese
characters, it won't be very useful if the cgi script sends
an error page.  Usually replacing unmappable characters for '?'
or something would be better.
(Again, it can be done by smart error handlers that does user-friendly
thing when 'encoding not supported' error.  It is much more handy
if port can handle it, though).

--shiro