General comments or SRFIs 79-82 Marcin 'Qrczak' Kowalczyk (27 Nov 2005 16:39 UTC)
|
Re: General comments or SRFIs 79-82
Michael Sperber
(28 Nov 2005 18:44 UTC)
|
I don't like the separation into readers/writers, streams, and ports. Too many similar concepts are treated as completely disjoint types. As I understand it: - readers/writers deal with physical I/O of blocks of bytes - streams provide encoding and newline conversion, buffering, and scanning the same input multiple times - ports provide raw I/O of bytes, can convert UTF-8 to characters, and can convert between characters and lines, or characters and Scheme external representations This feels like a single package. There should be some overall description of the whole design somewhere, so one doesn't have to dig into four separate SRFIs. I don't quite understand the rationale for using UTF-8 as the intermediate format. For mixing textual and binary I/O (if the encoding is not known to be UTF-8) one has to put and remove a converter dynamically on every switch, and it's incompatible with block-conversion of input (it must be converted one character at a time, unless we can find the boundary between text and binary data when looking at the raw stream before the conversion). EOL style doesn't include the possibility of accepting any of the three common conventions, which is used by Java and probably .NET. Since on classic Macintosh Perl (and perhaps C too, haven't checked) exchanges the meanings of \n and \r (by actually changing their interpretation in the source instead of recoding), I guess it would be more useful to exchange them when recoding in the CR style, instead of by treating either as a newline on input and writing a newline for either on output. StdIn and StdOut can be seekable, and it's sometimes useful (e.g. Unix "wc" makes use of this). The reference implementation doesn't allow that. I don't understand input-string. How much does it read? When reading from ports, it's not specified what happens when data are not valid UTF-8. Similarly for decoding from e.g. UTF-16 (unpaired surrogates), UTF-32 (too large values), or encoding to latin-1 (characters above U+00FF). * * * Since I have similar goals when designing I/O for my language Kogut, I might be biased by not liking any other solution than my own. Anyway, here is the overall of mine for comparison: All your three layers are served by a single concept of streams with varying capabilities. Streams are classified into input streams and output streams (some might be bidirectional), and independently into byte streams and character streams. There are many kinds of streams, but there is only one conceptual layer of interfaces. By "arrays" below I mean arrays from my language, which have mutable size and can play the role of queues or buffers. Removing or adding a small part at the beginning or end has a good amortized time. Among others there are byte arrays and char arrays. All input streams support reading a block by appending it into the given array, with the given maximum size; they may read less if reading more would block. All output streams support writing by cutting the beginning part from an array, up to the given maximum size; it might actually transfer less. Here are various concrete stream types: - Raw file. It basically supports the above block I/O, and optionally seeking (if they underlying OS file does). - Buffered stream on top of another stream. It supports reading or writing by a single character or line, automatic flushing after lines (or any I/O). A buffered input stream supports a subset of the interface of sequences (like indexing or getting subsequences) and thus provides arbitrarily long lookahead; it supports unreading with arbitrary size. It supports seeking if the underlying stream does. - Byte array or char array is a stream itself, with similar capabilities to a buffered stream. Reading from an array consumes its contents: an array doesn't have a separate current pointer. - Filtered stream supports only block I/O by passing data through a filtering function and maintaining an internal buffer. Here the type of elements can change between bytes and characters. The filtering function moves data from one array to another, and has the luxury of being allowed to leave some final part of the input (e.g. incomplete character). - There is an input stream which flushes a given output stream on every input. It is put on the bottom of stdin to flush the top of stdout, or when making a pair of streams from a socket, to ensure that buffered data are written out before we wait for input on a related stream. - Various less important streams: reading from another stream or sequence without consuming it, /dev/null, concatenation of inputs (which can be "opened" lazily), a filtering stream which copies everything passing through it into another stream (Unix "tee"). You can read a lazy list of characters or lines from a stream, which provides something similar to your stream with respect to statelessness. It is a natural data source for a complex parser. I haven't described making filters basing on an encoding or encoding name, handling recoding errors, non-blocking I/O, and I/O without using the current file pointer (Unix: pread / pwrite). -- __("< Marcin Kowalczyk \__/ xxxxxx@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/