Case-mapping, Unicode & internationalisation
shivers@xxxxxx
(24 Jan 2000 13:37 UTC)
|
Re: Case-mapping, Unicode & internationalisation
Sergei Egorov
(24 Jan 2000 17:09 UTC)
|
text processes vs. string procedures
shivers@xxxxxx
(24 Jan 2000 21:52 UTC)
|
Re: text processes vs. string procedures
Sergei Egorov
(24 Jan 2000 22:39 UTC)
|
Re: text processes vs. string procedures shivers@xxxxxx (25 Jan 2000 01:19 UTC)
|
Re: text processes vs. string procedures shivers@xxxxxx 25 Jan 2000 01:19 UTC
From: "Sergei Egorov" <xxxxxx@informaxinc.com> I understand your concern; many people do use ASCII and Latin-1 case mapping and are happy with what they get from the good old char-upcase and char-downcase. And I am not against char-upcase and char-downcase as long as their definition is limited to ASCII; otherwise you will have to ignore three problems mentioned in the Unicode book: uppercase I may map to either i or dotless i (in Turkish), two uppercase letters SS may map to a single lowercase sharp s in German, and this thing with French \'e. We are lucky that there are just three problems with case folding, but collation is *much* worse. My suggestion would be to restrict char-upcase, char-downcase, and their derivatives to ASCII and explicitly specify that string>? and other comparisons are based on mechanical code-point comparison that might not correspond to any 'natural' comparison in a real language. This approach makes the library reasonably useful, simple to implement, and really fast. I believe that attempting to define language-dependent interface to collation based on strings is wrong: collation works best when it deals with language-specific units larger than one character, and the 'text' abstraction suits this task much better. Wait wait wait -- I am *not* proposing CHAR-UPCASE and CHAR-DOWNCASE. These procedures are *not* part of SRFI-13. You are quite right -- they have real problems with non-ASCII char encodings. What is in SRFI-13 is STRING-UPCASE STRING-DOWNCASE STRING-TITLECASE These can handle the various issues involved in case-mapping text (e.g., upcasing German es-szet expanding to 2 chars, Greek sigma downcasing in a context-dependent way, titlecasing compound chars like "fi" or "dz"). No problem. Unicode TR 21 explains clearly and carefully how to do it for Unicode. Note also that I punted the side-effecting STRING-UPCASE! et al. because of the one-char->two-char case mapping issues. Your general point about these operations no longer being simply char->char, but being string->string or text->text is right on the money. However, I have nothing intelligent to say about collation and string comparison in the wide Unicode world today. If I can't come up with something reasonable that works in ASCII, Latin-1 *and* a Unicode setting, I'll punt the string-comparison functions, which I think would be a huge blow to the library. -Olin