Re: Bytevectors instead of strings in SRFI 170

Show/hide message thread

Bytevectors instead of strings in SRFI 170 Lassi Kortela (02 Aug 2020 19:30 UTC)

Re: Bytevectors instead of strings in SRFI 170 Marc Nieper-Wißkirchen (03 Aug 2020 20:18 UTC)

Re: Bytevectors instead of strings in SRFI 170 John Cowan (03 Aug 2020 21:23 UTC)

Re: Bytevectors instead of strings in SRFI 170 Marc Nieper-Wißkirchen (03 Aug 2020 21:28 UTC)

Re: Bytevectors instead of strings in SRFI 170 Lassi Kortela (03 Aug 2020 21:32 UTC)

Re: Bytevectors instead of strings in SRFI 170 Marc Nieper-Wißkirchen (03 Aug 2020 21:34 UTC)

Re: Bytevectors instead of strings in SRFI 170 Marc Nieper-WiÃkirchen 03 Aug 2020 21:28 UTC

Thank you for your explanation. What does it mean for, say,
R7RS-small's 'open-input-file'? How are filenames handled that cannot
be interpreted as a valid Unicode character sequence?

If this is a general problem, would it make sense to make the
R7RS-small procedures polymorphic so that they accept bytevectors as
arguments as well?

Am Mo., 3. Aug. 2020 um 23:23 Uhr schrieb John Cowan <xxxxxx@ccil.org>:
>
> In Posix systems, a filename is a sequence of arbitrary bytes with the exceptions of 0x2F (/) and 0x00.  In Windows, a filename is a sequence of arbitrary 16-bit shorts (stored little-endian, like everything on Windows) with the exceptions of 0x002F (/), 0x005C (\), 0x003C (<), 0x003E (>), 0x0022 ("), 0x003A (:), 0x007C (|), 0x3F (?), 0x002A (*), and 0x0000.
>
> Posix filenames must be translated between the implicit character encoding used by the filesystem and whatever the internal Scheme encoding is.  This also varies with the framework in use, if any: Glib assumes that all external paths are in UTF-8, whereas KDE assumes that they are in the encoding specified by the process locale.  Windows filenames must be translated between UTF-16BE and whatever the internal Scheme encoding is.
>
> In both cases, it's possible to create pathnames that cannot be interpreted as a sequence of Unicode characters.  This also means that Lassi's pre-SRFI must have some way of telling the caller whether names are 8-bit or 16-bit.
>
>
> On Mon, Aug 3, 2020 at 4:18 PM Marc Nieper-Wißkirchen <xxxxxx@nieper-wisskirchen.de> wrote:
>>
>> Would using strings assume that the underlying character encoding of
>> the OS is UTF-8? Can we assume this in 2020? Or do we have to convert
>> from whatever local encoding to Unicode?
>>
>> Strings instead of bytevectors make some sense because the basic
>> R7RS-small procedures dealing with file names all take (Unicode)
>> strings as arguments.
>>
>> Am So., 2. Aug. 2020 um 21:30 Uhr schrieb Lassi Kortela <xxxxxx@lassi.io>:
>> >
>> > The following procedures take and return strings:
>> >
>> > (read-symlink fname)
>> > (directory-files dir [dotfiles?])
>> > (make-directory-files-generator dir [dotfiles?])
>> > (open-directory dir [dot-files?])
>> > (real-path path)
>> > (create-temp-file [prefix])
>> > (call-with-temporary-filename maker [prefix])
>> > (user-info uid/name)
>> > (group-info gid/name)
>> >
>> > I think we should define them so that when bytevectors are given in
>> > place of strings as arguments, strings will be returned as bytevectors
>> > as well. This is what the Python OS APIs do nowadays: for example,
>> > compare os.listdir("/") and os.listdir(b"/"). IIRC they started with
>> > strings only but ran into trouble and had to add the bytes support.
>> >
>> > In SRFI 170 (current-directory) is idiosyncratic in that it returns a
>> > string but doesn't take any string argument that we could decide whether
>> > to return a string or a bytevector. Also, Schemes like Gambit and Kawa
>> > have a per-thread current directory which I presume is a string.
>> >
>> > I'll provide a `os-current-directory-as-bytevector` procedure in the
>> > upcoming SRFI
>> > <https://misc.lassi.io/2020/srfi-submit/process-state-as-bytevectors.html>.