Re: Encodings. Bradd W. Szonye 14 Feb 2004 02:54 UTC

Bradd wrote:
>> *Any* program that handles Unicode data implements Unicode! That
>> includes Scheme compilers that support Unicode sources.

Ken Dickey wrote:
> Ok.  Pick an example.

Why? Any process that claims to support Unicode must conform to the
Unicode standard.

> According to the docs, Gambit 3.0 supports Unicode.
> But..
>
> > (define great (string-ref "\x5927" 0)) ;; "(U+5927)"
> > great
> #\*** ERROR -- IO error on #<output-port (stdout)>

I have no idea whether that indicates conformance or not. Is "\x5927"
valid Gambit syntax for the Unicode codepoint U+5927? If not, then this
example is meaningless. Does the output port use a Unicode encoding by
default? If not, this example is meaningless.

>>> Who cares?

>> Anybody who wants to claim that his compiler supports Unicode. It's a
>> licensing issue. Unicode is a trademark, and you can't claim that you
>> "support" Unicode unless you actually conform to the standard.

> So does Gambit support Unicode or is the consortium going after
> somebody for non-compliance?

They might. I don't know what their enforcement policy is. I don't even
know for certain whether they have one (although that's usually how it
works when you trademark the name of the standard).

> While Gambit reads unicode files, I don't believe it does normalization.

I don't think normalization is required, but "reading Unicode files"
does demand that it recognize when graphemes are canonically identical.

> It does allow kanji identifiers
> ([kanji] 5) => 120
> Does Gambit comform?

That isn't nearly enough information to judge. And I don't know what
point you're trying to make here, but you're being extremely rude about
it. C'mon, you just asked a completely ridiculous question. You can't
judge conformance to a large standard from a small example like this,
unless the example demonstrates obvious *non*conformance. Why are you
being so antagonistic?

>> Normalization is not difficult or expensive in a batch program like a
>> compiler.

> Huh?  There are plenty of small Scheme interpreters out there.  The
> binary for TinyScheme is ~100KB.

Interpreters *are* compilers. They just target a software VM instead of
a hardware machine. See EOPL.

> There are plenty of interactive compilers out there.

"Batch" was a bad choice of words, perhaps. Anyway, processing Unicode
isn't any more difficult or expensive in an interactive process.

>> In particular, if you're carrying around the data for "Is this a
>> letter or a number?" it's trivial to also provide the canonical
>> compositions and decompositions. I don't know where you got the idea
>> that it's expensive.

> I think it is the "if you're carrying around the data for" part that I
> am worried about.  Blocks are one thing, but I see that the UniHan.txt
> file is 25 MB and I am worried that large tables could double or
> triple the size of a small Scheme implementation.

On many systems, the Scheme implementation doesn't need to carry the
data around. It's part of the operating system interface. If it isn't,
and that's a problem, then *don't implement Unicode.* But don't make a
half-assed implementation and claim that you "support" it.

Look, if a terminal claimed to support ANSI X3.64, but it didn't honor
the clear-screen function, you'd call it a crappy, non-conforming
implementation, wouldn't you? It's exactly the same with a compiler that
claims to support Unicode but doesn't recognize when two encodings are
canonically equivalent. I don't know whether you're upset that I
poo-pooed your idea or what, but you're being unreasonable and rude.
Please stop.
--
Bradd W. Szonye
http://www.szonye.com/bradd