Re: upcoming revision, need feedback

Show/hide message thread

upcoming revision, need feedback Derick Eddington (10 Jan 2010 03:34 UTC)

Re: upcoming revision, need feedback Vitaly Magerya (10 Jan 2010 16:15 UTC)

Re: upcoming revision, need feedback Derick Eddington (10 Jan 2010 23:48 UTC)

Re: upcoming revision, need feedback Vitaly Magerya (11 Jan 2010 02:57 UTC)

Re: upcoming revision, need feedback Derick Eddington (11 Jan 2010 05:03 UTC)

Re: upcoming revision, need feedback Vitaly Magerya (11 Jan 2010 13:50 UTC)

Re: upcoming revision, need feedback Derick Eddington 11 Jan 2010 05:03 UTC

Thanks for your input on these issues.

On Mon, 2010-01-11 at 04:57 +0200, Vitaly Magerya wrote:
> Derick Eddington wrote:
> > I think the pathname component separators do need to be defined.
> > [...] if they're undefined, the encoded set would not be clearly,
> > precisely, completely specified.
>
> The current draft sets the encoded set to be <a list of chars and the
> path separator>. The set of path separators depends on a platform [1],
> but the set of encoded characters should not (for portability reasons).
> So you must include all the possible separators from all the supported
> platforms in the encoded set -- after that specifying each of them
> separately serves no purpose.

I know the encoded set is supposed to be the same on all targeted
platforms and so must include all their path separators.  I know the
current draft needs to be improved in this regard, and that's why I'm
planning on making this change I listed:

        8) Rephrase to say what the set of encoded characters is and
        then say why particular characters are encoded.  The current
        phrasing is not as clear.

(which probably wasn't described clearly enough)

It will state the set of literal characters (i.e. not abstract ones like
the current draft's "path separator") and then say things like "these
are encoded for this reason" and "these same ones for this additional
reason".

I think you're right that defining the path separators should not be
redundantly done outside the (planned) explanation for why characters
are encoded.

(I still think the environment variable element separators should be
defined in the sections about the environment variables, even though
they'll also be specified in the encoded characters explanation.)

> >>> 7) Add #\; to the set of encoded characters, because a directory could be both
> >>> in the SCHEME_LIB_PATH sequence and correspond to a library name component.
> >>> Such a directory with a name including #\; is unusual but must be supported,
> >>> otherwise an unencoded #\; would be misinterpreted in SCHEME_LIB_PATH.
> >>
> >> I heard that when you strive to fail safety it's best to enumerate
> >> allowed things, not the forbidden ones.
> >
> > I don't think that justifies what you suggest below.
>
> It is generally hard to list all the failure conditions, but easy to
> list success conditions.
>
> Let me illustrate: ~ is missing in the encoded set, since Windows
> threats that character specially (e.g. "PROGRA~1" is a shortcut to the
> first file starting with "Progra").

Ugh.  The Microsoft page [1] about what characters to avoid does not say
that #\~ is treated specially.  Should #\~ be added to the encoded set?

> Another example is Â¥ (U+00A5). When represented in Japanese cp-932 it
> maps to #x5C (just as \ does in ascii), which is treated as a path
> separator. Because of this some programs (e.g. Cygwin) will choke on
> filenames with U+00A5 when cp-932 is your local codepage, even though
> U+00A5 itself is perfectly legal. This also applies to â‚© (U+20A9) in
> Korean (cp-949), and possibly more.

Ugh.  I think that type of problem should be outside this SRFI's
concern, because it's variable and dependent on individuals' codepage
configurations, and there is not a proper solution (encoding the
majority of characters is not acceptable).

> >> How about "Encode everything
> >> except for [a-zA-Z0-9_.-]"? It's safe, short, simple and works for 99%
> >> of libraries without any encoding at all.
> >
> > Other cultures' characters must be usable unencoded, especially since
> > the targeted file systems support using them, and we want other
> > cultures' use of Scheme to not be discriminated against growing to be
> > more than 1% of libraries.
>
> FWIW, using non-ascii symbols in source files is widely considered bad
> manners in my culture. So while I do recognize value in not needing to
> encode these symbols, I won't complain much about the discrimination.

Well, I think that's an unfortunate consequence of archaic poor
English-only designs, and your culture should take advantage of modern
character freedom :)

I think the cultures with very different alphabets, and there are
millions of programmers in them, are those who most appreciate being
able to use their characters unencoded.  If their characters are
encoded, their library files' names will be unintelligible, and that's
not acceptable.

> Also note that file system support for localized characters in Windows
> is (was?) problematic since it uses local codpeage in many places. Due
> to this a filename with a Ukrainian 'Ñ–' (U+0456) is not accessible via
> an SMB mount from a Windows with Russian settings [2].
>
> [2] Once upon a time this bit a fair share of accountants in Ukraine.

The Microsoft page [1] says what characters are disallowed, and that's
what this SRFI is following.  I'll add whatever other unmentioned
prohibited/special/reserved characters to the set if necessary, but I
will not make other cultures' characters be encoded.  People who want to
use whatever characters can configure their Windows crap to make those
characters work in file names, right?  And when such files are packaged
and distributed to another platform, the correct file names will be
used, right?

[1] http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx

--
: Derick
----------------------------------------------------------------