Identifiers bear 12 Feb 2004 07:31 UTC


On Wed, 11 Feb 2004, Bradd W. Szonye wrote:

>> At Tue, 10 Feb 2004 13:06:28 -0800 (PST), Tom Lord wrote:
>>> There is an easy example of why such a category is desirable in
>>> computing.  Let's suppose that I'm going to specify the lexical
>>> syntax of identifiers in a programming language.  As part of that
>>> specification, I'll need to identify this category.  (For an example,
>>> see "Unicode Technical Report #31: Identifier and Pattern Syntax",
>>> http://www.unicode.org/reports/tr31/tr31-2.html)
>
>Alex Shinn wrote:
>> We may want to take that report with a grain of salt for Scheme.  A
>> simpler approach would be to define Scheme identifiers as everything
>> _excluding_ the reserved punctuation characters, optionally allowing
>> Unicode variations on those characters and extending the definition of
>> whitespace.  Most Schemes already work in this manner, despite the
>> fact that R5RS uses an inclusive list ....
>
>Agreed. It has the same basic flaw as Annex 7 of UTR 15: It isn't a
>syntax for programming-language identifiers, it's a syntax for C-family
>identifiers! Both reports blithely ignore the fact that not all

Agreed.

There are some appropriate restrictions, I think; identifiers should
not begin with:

 * a combining character
 * a non-character codepoint
 * a whitespace character
 * a control character
 * characters which can begin syntactically valid numbers
      (digits, sign, point)
 * a delimiter (parens, at least)

Identifiers should not contain:
  * whitespace
  * delimiters
  * non-character codepoints
  * control characters
  * invalid sequences

The minimum requirement for case insensitivity as defined by
R5RS gives another rule:

  * no character in an identifier ought to be automatically
    converted to the implementation's preferred case (and no
    identifier differing only by that character versus another
    ought to be considered the same identifier)  unless it is
    part of a one-to-one reciprocal pair of upper and lower case
    characters as identified by char-upcase, char-downcase, and
    char-ci=?.   This finally is the property that is required
    for the char-alphabetic? characters in the portable character
    set: R5RS does not say so specifically but it is not possible
    to comply with R5RS without meeting this requirement.

Note that R5RS permits 'rules raping' in terms of this requirement;
An implementation of R5RS is fairly easy if no characters other than
a ... z and A ... Z are case-folded in case insensitive identifiers
and char-alphabetic? returns #t for only those characters.  The
information returned from char-alphabetic? would be false in that
case for all other alphabetic characters, but the letter of R5RS
(so to speak) would be satisfied, however uselessly.

				Bear