Re: Encodings. Bradd W. Szonye 13 Feb 2004 18:03 UTC

On Fri, Feb 13, 2004 at 07:51:49AM +0100, Ken Dickey wrote:
>>> What about creating a tool which reads bizarre Unicode and writes it
>>> out in a canonical format?  Then requiring portable Scheme programs to
>>> pass through it?

> Bradd W. Szonye wrote:
>> That wouldn't help unless they agree to write the *same* canonical
>> format. Besides, this is just separating part of the reader's job into
>> an external program, and in an error-prone way.

> I think there is again confusion between processing Unicode data and
> reading Scheme programs.
>
> Let's say that there is a Scheme SRFI (or even, *GASP*, a standard)
> which picks a single cannonical Unicode form (say the most compact
> one) and requires, where Unicode is used, that Scheme programs be
> prepared in that format ....

Such a program would not conform to the Unicode standard:

    C9. A process shall not assume that the interpretations of two
        canonical-equivalent character sequences are distinct.

This section goes on to concede that

    Ideally, an implementation would always interpret two
    canonical-equivalent character sequences identically. There are
    practical circumstances under which implementations may reasonably
    distinguish them.

For example, a program may implement an earlier version of the standard,
and therefore not recognize that newer sequences are supposed to be
canonically equivalent. However, a program that implemented Unicode in
the way you suggest would be perversely ignorant, much like Bear's
example of a Scheme reader that only case-folded the letters from A..Z.

In other words, recognizing canonically-equivalent characters *is* the
responsibility of the reader, if it claims to implement the Unicode
character set. If you view the combined converter and reader as a single
"program," then you might technically conform to the standard, but it
would be a perverse conformance, and therefore undesirable.
--
Bradd W. Szonye
http://www.szonye.com/bradd