On Fri, Dec 4, 2020 at 5:22 PM Marc Nieper-Wißkirchen <xxxxxx@nieper-wisskirchen.de> wrote:

It depends. When the Scheme implementation allows filenames specified as bytevectors all the processing can (and should) happen at the level of bytes. A conversion to a string would be necessary for displaying the filenames to the user. Unfortunately, R7RS's utf8->string suffers from the same problem as `get-environment-variable`: Anything may happen in case the bytevector cannot be decoded. R6RS specifies that a replacement character should be inserted instead.

There is nothing to prevent someone from writing a SRFI that provides such a facility. One idea would be to pass a handler accepting an index into the bytevector where the undecodable byte sequence begins and returning a string of characters to be inserted and the index to continue processing from. This could be done using the continuable exception framework, which makes it easy to break out of the call to safe-bytevector->utf-8 procedure. (Note that the cost of raising an exception is only slightly higher than making an indirect call to a procedure, because the handler stack has to be manipulated.)

I think the basic problem is that there are some situations where R7RS "it is an error" where the situation is not controllable by the programmer.

R6RS could eliminate these because it was willing to break backward compatibility with existing systems; R7RS could not, because it was not. It would be interesting, however, to go through the 114 instances of "is an error" in R7RS-small and attempt to isolate the ones in question, and then see how they map onto the condition types of R6RS-libs chapter 8. Then a SRFI could be written specifying a suitable set of error predicates. (There is a minor naming confusion possible between R7RS `read-error?`, which is for `read` to report lexical syntax errors, and R6RS `i/o-read-error`, which is for any kind of input failure.)

Maybe John knows: Is there a way to encode an arbitrary bytestring in a Unicode string? The problem with the replacement character is that information is lost. I am wondering whether one could set up a meaningful bijection between the countable set of bytevectors and the countable set of Unicode strings.

There are many such bijections. Obviously base64, hex digits, and similar mappings do the job at the expense of even partial intelligibility. One approach for use when UTF-8 is expected but may not be forthcoming would be to use the reserved non-character U+FFFE (which is meant for internal purposes only and should never be used in interchange) followed by two hex digits to represent an isolated undecodable byte. In principle an R7RS-small implementation might not be able to represent this character, but in practice all R7RS-small implementations are `full-unicode`, at least with default compiler options.

John Cowan http://vrici.lojban.org/~cowan xxxxxx@ccil.org
Possession is said to be nine points of the law,
but that's not saying how many points the law might have.
--Thomas A. Cowan (law professor and my father)