Re: Encodings. Bradd W. Szonye 13 Feb 2004 22:07 UTC

Ken Dickey wrote:
>>> Let's say that there is a Scheme SRFI (or even, *GASP*, a standard)
>>> which picks a single cannonical Unicode form (say the most compact
>>> one) and requires, where Unicode is used, that Scheme programs be
>>> prepared in that format ....

Bradd wrote:
>> Such a program would not conform to the Unicode standard:

> Who cares?

Anybody who wants to claim that his compiler supports Unicode. It's a
licensing issue. Unicode is a trademark, and you can't claim that you
"support" Unicode unless you actually conform to the standard.

> Scheme does not conform to ASCII or EBCDIC.

A Scheme implementation that claims support for ASCII or EBCDIC had
better conform to the standards for those encodings. Even without the
trademark & licensing issues, it's downright perverse to claim support
for an encoding if you don't actually conform to the standard for it.

> Why should Scheme conform to the Unicode Standard(s)?

We *are* talking about Schemes that support Unicode, right? Well, part
of supporting Unicode is conforming to the standard. Seriously, why
would any vendor in his right mind provide this kind of
half-implementation?

> Defining what is an acceptable Scheme program should be sufficient.

Not if you want to call it "Unicode," it isn't.

> It is desirable that a Scheme with support for extended identifiers
> should not be large or expensive to implement.

Normalization is not difficult or expensive in a batch program like a
compiler. In particular, if you're carrying around the data for "Is this
a letter or a number?" it's trivial to also provide the canonical
compositions and decompositions. I don't know where you got the idea
that it's expensive.

Deciding which normal form makes the most sense for your application is
the hardest part. Thus the discussion of what to fold when normalizing
or collating/comparing identifiers. And you need to do it regardless of
which bit of code does the normalization.

Also, I'm wondering how it would actually be faster or simpler to handle
it in a separate program.

>>     C9. A process shall not assume that the interpretations of two
>>         canonical-equivalent character sequences are distinct.
>>
>> This section goes on to concede that
>>
>>     Ideally, an implementation would always interpret two
>>     canonical-equivalent character sequences identically. There are
>>     practical circumstances under which implementations may reasonably
>>     distinguish them.

> Scheme does not IMPLEMENT Unicode.

*Any* program that handles Unicode data implements Unicode! That
includes Scheme compilers that support Unicode sources.

>> In other words, recognizing canonically-equivalent characters *is*
>> the responsibility of the reader, if it claims to implement the
>> Unicode character set.

> I still fail to see why one would wish to make such a claim.

That's the whole point of this SRFI! Any program which can process
Unicode characters is an implementation of Unicode and must conform to
the standard. If you don't, you can't claim support for Unicode (and the
Consortium has trademarks to back that up).

> I have not yet seen a convincing case made for making Scheme "a
> conforming Unicode implementation".  Convince me!

I suspect that you simply don't understand what a "Unicode
implementation" is.
--
Bradd W. Szonye
http://www.szonye.com/bradd