Re: text processes vs. string procedures
shivers@xxxxxx 25 Jan 2000 01:19 UTC
From: "Sergei Egorov" <xxxxxx@informaxinc.com>
I understand your concern; many people do use ASCII and Latin-1 case
mapping and are happy with what they get from the good old char-upcase and
char-downcase. And I am not against char-upcase and char-downcase as long
as their definition is limited to ASCII; otherwise you will have to ignore
three problems mentioned in the Unicode book: uppercase I may map to either
i or dotless i (in Turkish), two uppercase letters SS may map to a single
lowercase sharp s in German, and this thing with French \'e. We are lucky
that there are just three problems with case folding, but collation is
*much* worse. My suggestion would be to restrict char-upcase,
char-downcase, and their derivatives to ASCII and explicitly specify that
string>? and other comparisons are based on mechanical code-point
comparison that might not correspond to any 'natural' comparison in a real
language. This approach makes the library reasonably useful, simple to
implement, and really fast. I believe that attempting to define
language-dependent interface to collation based on strings is wrong:
collation works best when it deals with language-specific units larger than
one character, and the 'text' abstraction suits this task much better.
Wait wait wait -- I am *not* proposing CHAR-UPCASE and CHAR-DOWNCASE.
These procedures are *not* part of SRFI-13. You are quite right -- they have
real problems with non-ASCII char encodings. What is in SRFI-13 is
STRING-UPCASE
STRING-DOWNCASE
STRING-TITLECASE
These can handle the various issues involved in case-mapping text (e.g.,
upcasing German es-szet expanding to 2 chars, Greek sigma downcasing in a
context-dependent way, titlecasing compound chars like "fi" or "dz"). No
problem. Unicode TR 21 explains clearly and carefully how to do it for
Unicode.
Note also that I punted the side-effecting STRING-UPCASE! et al. because
of the one-char->two-char case mapping issues.
Your general point about these operations no longer being simply
char->char, but being string->string or text->text is right on the money.
However, I have nothing intelligent to say about collation and string
comparison in the wide Unicode world today. If I can't come up with something
reasonable that works in ASCII, Latin-1 *and* a Unicode setting, I'll punt the
string-comparison functions, which I think would be a huge blow to the
library.
-Olin