> At Tue, 10 Feb 2004 13:06:28 -0800 (PST), Tom Lord wrote:
>> There is an easy example of why such a category is desirable in
>> computing. Let's suppose that I'm going to specify the lexical
>> syntax of identifiers in a programming language. As part of that
>> specification, I'll need to identify this category. (For an example,
>> see "Unicode Technical Report #31: Identifier and Pattern Syntax",
>> http://www.unicode.org/reports/tr31/tr31-2.html)
Alex Shinn wrote:
> We may want to take that report with a grain of salt for Scheme. A
> simpler approach would be to define Scheme identifiers as everything
> _excluding_ the reserved punctuation characters, optionally allowing
> Unicode variations on those characters and extending the definition of
> whitespace. Most Schemes already work in this manner, despite the
> fact that R5RS uses an inclusive list ....
Agreed. It has the same basic flaw as Annex 7 of UTR 15: It isn't a
syntax for programming-language identifiers, it's a syntax for C-family
identifiers! Both reports blithely ignore the fact that not all
languages restrict identifiers to letters, numbers, and underscores.
Even COBOL permits dashes!
The other thing I didn't care for (in UTR 15) was the recommendation to
use NFC for case-sensitive languages and NFKC for case-insensitive
languages. NFC is designed for round-trip conversions, and it often uses
different encodings for visually indistinguishable symbols. For example,
the letters "ffi" and the "ffi" ligature are distinct under NFC (IIRC).
That's a very bad property for programming language identifiers.
Unfortunately, NFKC isn't perfect either. One thing I especially dislike
is that it flattens the differences between the mathematical alphabets.
Here you have a case where graphemes *are* visually distinguishable, and
for good reason, but the normalization form treats them as identical. If
you're working on a sublanguage for symbolic mathematics, you might be
tempted to write "double-struck small letter j" for a unit vector and
"italic small letter j" for the imaginary unit. But NFKC folds them
together. You'll need to modify NFKC for mathematics, or track the
semantic data separately (which amounts to the same thing).
It's especially bad if you're considering a language for typesetting
mathematics! (Not that anybody would ever want to implement a TeX-like
language in Scheme, right?) The Unicode character set is well-suited to
that task, but the normalization forms aren't, IMO.
Some of this is only tangentially relevant to Scheme, I realize.
However, I don't think the identifier requirements were particularly
well-thought-out. The standard normalization forms seem poorly suited
for precision tasks like source code. "If it looks the same, it may or
may not be the same thing" may work well enough for word processors, but
it's not good for compilers. And there's still the annoying fact that
these UTRs basically imply, "You can have identifiers for any language
you want, as long as it's C!"
--
Bradd W. Szonye
http://www.szonye.com/bradd