Mixing characters and bytes Per Bothner (24 Aug 2005 07:24 UTC)
|
Re: Mixing characters and bytes
Michael Sperber
(24 Aug 2005 17:34 UTC)
|
Mixing characters and bytes Per Bothner 24 Aug 2005 07:24 UTC
There are a number of problems in mixing text and binary I/O on the same port objects which I won't repeat here. On the other hand, doing so is sometimes useful. Two use-cases: * A binary file will often contain strings. * A text file may be self-describing in terms of the encoding it uses. In the case of XML files, see: http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing In that case you start out reading bytes until you can determine the encoding, at which point you switch to reading characters. Note that neither use-case is supported by SRFI-68 unless you stick to UTF-8 or go down into the stream level. So how can we specify mixed character/byte ports? First, if we allow both character and bytes on a port, then a strict byte-port vs character-port separation may be undesirable. However, note that the use-cases above do not preclude such a separation. The case of a binary file containing strings can be supported by routines to convert from a blob to a string and vice versa. The second case can be handled by opening a text port on top of the "rest of" (tail) of a binary port. Second, some ports are text-only, in the sense that they cannot meaningfully support byte operations. This includes the string ports specified by SRFI 6. For output ports we can handled mixed character/byte output fairly cleanly: Underlying the port is a pure byte port (sink). The port can be in either character or byte mode, and starts out in byte mode. In byte mode, byte data is written directly to the byte sink. A character operation in byte mode creates a character buffer and switches to character mode. Subsequence character operstions append to the character buffer. A byte operation *or* closing the file while in character mode causes the characters in the character buffer to be encoded, and then written to the byte sink. In Java one can implement this model fairly efficiently using the standard character-to-byte encoding machinery. It can handle arbitrary encodings using the standard Java machinery (and doesn't need to use the relatively new JDK 1.4 encoding apis). It loses state when switching from text to binary and back again, but that's aminor issue: people don't do that very much, nor are stateful encoding very common anyway. Reading is trickier. I can switch from reading bytes to reading characters fairly easily and robustly, assuming I have a function to create a character port from a byte port (as Java does). But switching from reading characters to reading bytes is difficult because the byte->character decoder might have read ahead or have bytes buffered. However, I don't think this is a big deal. It is only a factor for binary files that contain strings. In a sane format, the string will either be preceded by a count, or will be followed by a delimiter, such as a nul byte. In that case you can extract the string as a byte array, and then convert it. (Conceptually, you can write it to a temporary file and then read that as a character file.) Buffering isn't a problem if we're using a non-stateful encoding *and* we do our own decoding. A suggestion: Switching from character mode to byte code is invalid unless the encoding was *explicitly* specified as UTF-8. (It's not enough for the encoding to default to UTF-8, since in that case we might be using the system translator, which might be doing buffering.) Switching from one encoding to another is similar to switching from text to binary to text mode. I suggest a function: (set-port-encoding! port encoding-name) -- --Per Bothner xxxxxx@bothner.com http://per.bothner.com/