Email list hosting service & mailing list manager


Re: Mixing characters and bytes Shiro Kawai 15 Sep 2005 03:51 UTC

>From: Michael Sperber <xxxxxx@informatik.uni-tuebingen.de>
Subject: Re: Mixing characters and bytes
Date: Tue, 13 Sep 2005 14:19:19 +0200

> Turns out I was wrong.  Switching encodings in the middle of a
> buffered data stream (in the general sense) is, AFAICS, very costly:
> You generally want to transcode text in chunks for efficiency.  This
> means that you'll typically transcode ahead of what the program has
> actually requested.  Now, switching encodings means going back to the
> place where you actually stopped requesting data, which means
> retracing your steps from the beginning of the last transcoding step.
> This would complicate the interface for defining translators
> considerably, and still leaves some border cases uncovered.

I agree basically.
In general, unless you switch from a basic encoding (binary
or ASCII) to more elaborated one (utf-8, euc-jp, etc.),
you have to tell transcoder where to stop beforehand, otherwise
the transcoder may read into the region that may contain
illegal byte sequence and raise an error (maybe not in srfi-68,
but the low-level routine such as iconv does).

One practical example is reading MIME-encoded message.
You may switch encodings, for example:
ASCII -> euc-jp -> ASCII -> binary -> ASCII -> utf-8.
What I'm doing currently (in Gauche) is to layer ports; that is,
I create a port that reads from the original port, but stops reading
and returns EOF when it encounters a MIME boundary.  From that
port I read the MIME header as ASCII, then layer the
encoding-conversion port if I need one.  This is conceptually
simple, but the drawback is that I need to buffer the data
multiple times.

I suppose, in srfi-68, I can do similar thing by layering streams.
I don't think each stream should switch encoding in it.

I haven't learned the stream layer API enough to tell if
it'd be any simpler and more efficient than layering ports, though.

--shiro