On Fri, Feb 13, 2004 at 07:51:49AM +0100, Ken Dickey wrote:
>>> What about creating a tool which reads bizarre Unicode and writes it
>>> out in a canonical format? Then requiring portable Scheme programs to
>>> pass through it?
> Bradd W. Szonye wrote:
>> That wouldn't help unless they agree to write the *same* canonical
>> format. Besides, this is just separating part of the reader's job into
>> an external program, and in an error-prone way.
> I think there is again confusion between processing Unicode data and
> reading Scheme programs.
>
> Let's say that there is a Scheme SRFI (or even, *GASP*, a standard)
> which picks a single cannonical Unicode form (say the most compact
> one) and requires, where Unicode is used, that Scheme programs be
> prepared in that format ....
Such a program would not conform to the Unicode standard:
C9. A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct.
This section goes on to concede that
Ideally, an implementation would always interpret two
canonical-equivalent character sequences identically. There are
practical circumstances under which implementations may reasonably
distinguish them.
For example, a program may implement an earlier version of the standard,
and therefore not recognize that newer sequences are supposed to be
canonically equivalent. However, a program that implemented Unicode in
the way you suggest would be perversely ignorant, much like Bear's
example of a Scheme reader that only case-folded the letters from A..Z.
In other words, recognizing canonically-equivalent characters *is* the
responsibility of the reader, if it claims to implement the Unicode
character set. If you view the combined converter and reader as a single
"program," then you might technically conform to the standard, but it
would be a perverse conformance, and therefore undesirable.
--
Bradd W. Szonye
http://www.szonye.com/bradd