Re: peek-char problem

Show/hide message thread

peek-char problem Shiro Kawai (18 Jun 2020 03:41 UTC)

Re: peek-char problem Marc Nieper-Wißkirchen (18 Jun 2020 05:52 UTC)

Re: peek-char problem Göran Weinholt (18 Jun 2020 09:03 UTC)

Re: peek-char problem Shiro Kawai (18 Jun 2020 10:13 UTC)

Re: peek-char problem Göran Weinholt (18 Jun 2020 12:25 UTC)

Re: peek-char problem Shiro Kawai (18 Jun 2020 19:09 UTC)

Re: peek-char problem Per Bothner (18 Jun 2020 16:29 UTC)

Re: peek-char problem Shiro Kawai (18 Jun 2020 18:53 UTC)

Re: peek-char problem GÃ¶ran Weinholt 18 Jun 2020 08:57 UTC

Shiro Kawai <xxxxxx@gmail.com> writes:

> I thought peek-char may not be so important in practice, since parsers
> could carry around a prefetched character. When I was updating the
> reference implementation, however, I noticed it might not be so
> simple.
>
> To implement read-line or read, you do need to lookahead one character
> (In case of read-line, you need to peek after CR is read, for the next
> character may or may not be LF. In case of read, if you're reading an
> identifier, you need to leave the subsequent delimiting character
> other than whitespaces.)
>
> If a custom port can't be passed to read-line or read, its use is
> severely limited.
>
> Am I missing some obvious workaround?

(I haven't gone through this SRFI yet, so what I'm saying is just based
on my experience with implementing R6RS.)

The read and read-line procedures work on a level where they do not see
untranslated newlines. They read from textual input(/output) ports and
any translation from CR LF to #\newline has already happened.

Here's a breakdown of how newlines are handled for each port type:

* Custom binary input port - binary data has no #\newline.

* Custom textual input ports - the source directly produces #\newline
  with no transcoding necessary.

* Custom binary output port - binary data has no #\newline.

* Custom textual output port - #\newline is sent directly to the sink.

* Custom binary input/output port - binary data has no #\newline.

* Transcoded binary input port - the transcoder parses newlines
  according to the eol style and converts them to #\newline (none means
  no translation, any other style means all styles are recognized and
  translated to #\newline).

* Transcoded binary output port - the transcoder translates #\newline
  according to the eol style.

* Transcoded binary input/output port - combination of the above.

The peek-char procedure either uses the data already in the port's
buffer or it calls the source to get data into the port's buffer. If the
port is unbuffered then I believe it is still necessary to fill in the
port's buffer with at least one character. An underlying unbuffered
binary port would have one byte at a time consumed until the transcoded
port has a full character.

It might seem like transcoders need to look ahead to recognize that CR
LF should be a single #\newline and that this would break unbuffered
ports. But actually the transcoder can get away with just knowing the
previous character. Suppose the input is #vu8(13 10). The first time
peek-char invokes the transcoder it will see CR (13) and return
#\newline. The second time it sees NL (10) but the transcoder remembers
that the previous char was CR, so it does not return any output. This
works for all the supported eol styles.

The read-line procedure does not need peek-char because it only needs to
recognize and stop at #\newline. The read procedure does indeed want
peek-char for stopping at delimiters and quite possibly for lexing in
general.

Hope that helps. Is there anything else that's unclear about the R6RS
I/O ports system?

(FWIW, anyone implementing an R6RS-alike I/O port system might want to
look at the libc stdio system. It is very similar and also uses sources
and sinks. One difference, which I see this SRFI touches on, is that
stdio sinks flush if they are given a zero argument. I guess using a
special flush procedure is better. Without an explicit flush, a custom
output port can't really do its own buffering, so this is an improvement
over R6RS.)

Regards,

--
Göran Weinholt   | https://weinholt.se/
Debian Developer | 73 de SA6CJK