Re: Encodings. Ken Dickey 13 Feb 2004 16:21 UTC

On Friday 13 February 2004 11:07 pm, Bradd W. Szonye wrote:
>> Ken Dickey wrote:
> > Scheme does not IMPLEMENT Unicode.

> Bradd wrote:
> *Any* program that handles Unicode data implements Unicode! That
> includes Scheme compilers that support Unicode sources.

Ok.  Pick an example.  According to the docs, Gambit 3.0 supports Unicode.

But..

> (define great (string-ref "\x5927" 0)) ;; "(U+5927)"
> great
#\*** ERROR -- IO error on #<output-port (stdout)>

> >> In other words, recognizing canonically-equivalent characters *is*
> >> the responsibility of the reader, if it claims to implement the
> >> Unicode character set.
..
> > Who cares?
>
> Anybody who wants to claim that his compiler supports Unicode. It's a
> licensing issue. Unicode is a trademark, and you can't claim that you
> "support" Unicode unless you actually conform to the standard.

So does Gambit support Unicode or is the consortium going after somebody for
non-compliance?

While Gambit reads unicode files, I don't believe it does normalization.

It does allow kanji identifiers

(だ-bゅ 5) => 120

Does Gambit comform?

> > It is desirable that a Scheme with support for extended identifiers
> > should not be large or expensive to implement.
>
> Normalization is not difficult or expensive in a batch program like a
> compiler.

Huh?  There are plenty of small Scheme interpreters out there.  The binary for
TinyScheme is ~100KB.

There are plenty of interactive compilers out there.  I almost never use a
Scheme compiler in a batch mode unless I am (re)building a runtime system.

[Bad choice of words?]

In particular, if you're carrying around the data for "Is this
> a letter or a number?" it's trivial to also provide the canonical
> compositions and decompositions. I don't know where you got the idea
> that it's expensive.

I think it is the "if you're carrying around the data for" part that I am
worried about.  Blocks are one thing, but I see that the UniHan.txt file is
25 MB and I am worried that large tables could double or triple the size of a
small Scheme implementation.

> I suspect that you simply don't understand what a "Unicode
> implementation" is.

You are probably right.

I am currently hacking up some code (part time) to do this.  I should
understand after I have written the code.

So far I am only up to processing CaseFold.txt and generating things like:
(case-fold (integer->uni-char #x00DF)) ;=> (#\U+0073 #\U+0073)
(uni-char=? #\A (integer->uni-char (char->integer #\A))) ;=> #t

Hey.  I am a slow learner.  I learn by doing.

Cheers,
-KenD