Re: on waste-of-time arguments.... Thomas Lord 20 Jul 2005 18:46 UTC

(Aside: sorry to keep breaking threading.  I've recently
changed MUAs and for some reason the new set up is mucking
up threaded replies.   Hopefully it'll be fixed soon.)

me:
>>    Therefore, if the character and string functions are "crude"
>>    with respect to natural language, then an implementation
>>    *can not* (cleanly, simply) allow identifier names which are
>>    globally-natural-language-friendly except in a crude way.

John:
> Can you give an example?  I don't understand how this principle
> applies. S-75 provides case-{in,}sensitive {character,string}
> {identity,collation} functions, and provides syntax for the full scope
> of Unicode scalar values as characters and USV sequences as strings.
> Furthermore, every character string can be mapped to a symbol and vice
> versa (excluding uninterned symbols, which are not part of the
> standard).  What is more, identifiers are explicitly made case-
> sensitive, so the definition of the string-ci family
> no longer affects them.

Bob Unihacker wants to implement a Scheme in which two identifiers
in the source text are identical if they are codepoint equal
under some chosen form of canonicalization.   He argues
that his particular canonicalization rule is the best.

He can not identify those two spellings of the identifier, though.
If he identified them, in his implementation, under S-75, two
different strings would have to convert to the same symbol.
S-75 doesn't permit that.

Hence my proposed solution which can be summarized:

    Allow, heck, even require chars and
    strings to be Unicode-friendly.

    Break the 1:1 string:symbol mapping and
    do not standardize symbol names or
    identifier names other than those
    containing only 8-bit codepoints.

That leaves Bob Unihacker with a very un-lisplike problem
to solve:  his nice fully unicode source texts are no longer
syntactically valid S-expressions.

However, unlike in the current S-57, under my proposal
Bob's solution is solvable -- multiple ways.    Thus,
Bob and all of his competitors can each propose new
higher-level char-like and string-like types, and/or
new conversion rules between strings and symbols, etc.

In other words, my proposal enables the kind of experimentation
the editors would like to see and, indeed, requires that
experimentation of Bob and friends.

> I don't see how we are forcing identifiers to be crude.  We are
> permitting distinct identifiers that look exactly alike, yes.
> However, if we allow identifiers other than in Latin script at all,
> then such spoofs are always possible; to take only the simplest
> example, Latin A, Greek Alpha, and Cyrillic A look exactly alike.

I dispute that any standard requires those characters to be presented
indistinguishably.

Moreover, different choices of identifier name canonicalization
permit or deny different subsets of the possible spoofs.   It
is up to future experimenters to find the sweet spot for Scheme
(though, certainly, consortium recommendations on the issue
ought not be ignored).

So, killing all possible spoofing isn't necessarily Bob's goal.
Picking and choosing among which kinds of spoofing is a good
goal and, imo, an open question.

>> b) An analogous argument applies to the streams emitted and consumed
>>    by READ and WRITE.   (This isn't *really* a separate point from
>>    (a) but people commonly treat it that way.)

> I don't understand this argument either, alas.

As I say above: code and data aren't really different in
Scheme although arguably that hasn't always been the case.
READ and WRITE are generic ways to externalize and import
data;  they ought to agree with the reader that loads
source text into an interpreter or compiler.

> Please spell out these implications (preferably with examples), as I
> remain entirely in the dark.

Does the above help?  Perhaps I'm in the dark (in the sense of missing
some point).

-t