On Wed, 11 Feb 2004, Bradd W. Szonye wrote:
>> At Tue, 10 Feb 2004 13:06:28 -0800 (PST), Tom Lord wrote:
>>> There is an easy example of why such a category is desirable in
>>> computing. Let's suppose that I'm going to specify the lexical
>>> syntax of identifiers in a programming language. As part of that
>>> specification, I'll need to identify this category. (For an example,
>>> see "Unicode Technical Report #31: Identifier and Pattern Syntax",
>>> http://www.unicode.org/reports/tr31/tr31-2.html)
>
>Alex Shinn wrote:
>> We may want to take that report with a grain of salt for Scheme. A
>> simpler approach would be to define Scheme identifiers as everything
>> _excluding_ the reserved punctuation characters, optionally allowing
>> Unicode variations on those characters and extending the definition of
>> whitespace. Most Schemes already work in this manner, despite the
>> fact that R5RS uses an inclusive list ....
>
>Agreed. It has the same basic flaw as Annex 7 of UTR 15: It isn't a
>syntax for programming-language identifiers, it's a syntax for C-family
>identifiers! Both reports blithely ignore the fact that not all
Agreed.
There are some appropriate restrictions, I think; identifiers should
not begin with:
* a combining character
* a non-character codepoint
* a whitespace character
* a control character
* characters which can begin syntactically valid numbers
(digits, sign, point)
* a delimiter (parens, at least)
Identifiers should not contain:
* whitespace
* delimiters
* non-character codepoints
* control characters
* invalid sequences
The minimum requirement for case insensitivity as defined by
R5RS gives another rule:
* no character in an identifier ought to be automatically
converted to the implementation's preferred case (and no
identifier differing only by that character versus another
ought to be considered the same identifier) unless it is
part of a one-to-one reciprocal pair of upper and lower case
characters as identified by char-upcase, char-downcase, and
char-ci=?. This finally is the property that is required
for the char-alphabetic? characters in the portable character
set: R5RS does not say so specifically but it is not possible
to comply with R5RS without meeting this requirement.
Note that R5RS permits 'rules raping' in terms of this requirement;
An implementation of R5RS is fairly easy if no characters other than
a ... z and A ... Z are case-folded in case insensitive identifiers
and char-alphabetic? returns #t for only those characters. The
information returned from char-alphabetic? would be false in that
case for all other alphabetic characters, but the letter of R5RS
(so to speak) would be satisfied, however uselessly.
Bear