Re: Encodings. Ken Dickey 13 Feb 2004 06:51 UTC

On Thursday 12 February 2004 11:07 pm, Bradd W. Szonye wrote:
> On Thu, Feb 12, 2004 at 02:10:18PM +0100, Ken Dickey wrote:
> > Ah!  So a broken language (huge tables and complex processing) must be
> > defined to deal with broken tools which do not write out Unicode data
> > in a canonical format.
>
> ..., there's more
> than one canonical form. The "C" forms compose characters into the
> smallest number of code-points possible. The "D" forms decompose them
> into fully-general base+combining forms. Programs which disagree on the
> form of the I/O will need to translate between the two.
>
> > What about creating a tool which reads bizarre Unicode and writes it
> > out in a canonical format?  Then requiring portable Scheme programs to
> > pass through it?
>
> That wouldn't help unless they agree to write the *same* canonical
> format. Besides, this is just separating part of the reader's job into
> an external program, and in an error-prone way.

I think there is again confusion between processing Unicode data and reading
Scheme programs.

Let's say that there is a Scheme SRFI (or even, *GASP*, a standard) which
picks a single cannonical Unicode form (say the most compact one) and
requires, where Unicode is used, that Scheme programs be prepared in that
format.  [And perhaps specify 'ascii/latin1/utf-8/ucs2/... parameters to open
the appropriate kind of input port].

This has essentially nothing to do with normalization and other processing of
Unicode data.

This means that a Scheme reader can use a fairly simple case-folding algorithm
(compared to "slice-em-dice-em kitchen knife" normalization algorithms) which
is fairly compact [871 case-fold "exceptions" in Unicode 4]  and hence leaves
implementations reasonably small.

I do not buy the argument that "this is just separating part of the reader's
job into an external program, and in an error-prone way."  I think that this
is keeping the reader manageable.  Saying you have to swallow the ocean to
process a stream is silly (and dangerous!).

$0.02,
-KenD