Case-mapping, Unicode & internationalisation
shivers@xxxxxx
(24 Jan 2000 13:37 UTC)
|
Re: Case-mapping, Unicode & internationalisation
Sergei Egorov
(24 Jan 2000 17:09 UTC)
|
text processes vs. string procedures
shivers@xxxxxx
(24 Jan 2000 21:52 UTC)
|
Re: text processes vs. string procedures Sergei Egorov (24 Jan 2000 22:39 UTC)
|
Re: text processes vs. string procedures
shivers@xxxxxx
(25 Jan 2000 01:19 UTC)
|
Re: text processes vs. string procedures Sergei Egorov 24 Jan 2000 22:41 UTC
Olin Shiver writes: [...] > - However, I think case-mapping and string-comparison are basic things, and > they can be given a generic, portable definition independent of the > underlying character encoding. Case-mapping does *not* require strings to be > well-formed text. ASCII, Latin-1 and Unicode all provide a clear, > language-independent definitions of this operation. > > I don't want the string library to be minimal. I want it to be useful. > People -- many of whom currently program with Latin-1 or ASCII Schemes -- > case-map and compare strings frequently. These operations can be provided > with an API which is portable across ASCII, Latin-1 and Unicode. So there's > no barrier here. I understand your concern; many people do use ASCII and Latin-1 case mapping and are happy with what they get from the good old char-upcase and char-downcase. And I am not against char-upcase and char-downcase as long as their definition is limited to ASCII; otherwise you will have to ignore three problems mentioned in the Unicode book: uppercase I may map to either i or dotless i (in Turkish), two uppercase letters SS may map to a single lowercase sharp s in German, and this thing with French \'e. We are lucky that there are just three problems with case folding, but collation is *much* worse. My suggestion would be to restrict char-upcase, char-downcase, and their derivatives to ASCII and explicitly specify that string>? and other comparisons are based on mechanical code-point comparison that might not correspond to any 'natural' comparison in a real language. This approach makes the library reasonably useful, simple to implement, and really fast. I believe that attempting to define language-dependent interface to collation based on strings is wrong: collation works best when it deals with language-specific units larger than one character, and the 'text' abstraction suits this task much better.