Re: text processes vs. string procedures

Show/hide message thread

Case-mapping, Unicode & internationalisation shivers@xxxxxx (24 Jan 2000 13:37 UTC)

Re: Case-mapping, Unicode & internationalisation Sergei Egorov (24 Jan 2000 17:09 UTC)

text processes vs. string procedures shivers@xxxxxx (24 Jan 2000 21:52 UTC)

Re: text processes vs. string procedures Sergei Egorov (24 Jan 2000 22:39 UTC)

Re: text processes vs. string procedures shivers@xxxxxx (25 Jan 2000 01:19 UTC)

Re: text processes vs. string procedures Sergei Egorov 24 Jan 2000 22:41 UTC

Olin Shiver writes:
[...]
> - However, I think case-mapping and string-comparison are basic things, and
>   they can be given a generic, portable definition independent of the
>   underlying character encoding. Case-mapping does *not* require strings to be
>   well-formed text. ASCII, Latin-1 and Unicode all provide a clear,
>   language-independent definitions of this operation.
>
>   I don't want the string library to be minimal. I want it to be useful.
>   People -- many of whom currently program with Latin-1 or ASCII Schemes --
>   case-map and compare strings frequently. These operations can be provided
>   with an API which is portable across ASCII, Latin-1 and Unicode. So there's
>   no barrier here.

I understand your concern; many people do use ASCII and Latin-1 case mapping
and are happy with what they get from the good old char-upcase and char-downcase.
And I am not against char-upcase and char-downcase as long as their definition
is limited to ASCII; otherwise you will have to ignore three problems
mentioned in the Unicode book: uppercase I may map to either i or dotless i
(in Turkish), two uppercase letters SS may map to a single lowercase
sharp s in German, and this thing with French \'e. We are lucky that
there are just three problems with case folding, but collation is
*much* worse. My suggestion would be to restrict char-upcase,
char-downcase, and their derivatives to ASCII and explicitly
specify that string>? and other comparisons are based on
mechanical code-point comparison that might not correspond
to any 'natural' comparison in a real language. This approach
makes the library reasonably useful, simple to implement, and
really fast. I believe that attempting to define language-dependent
interface to collation based on strings is wrong: collation works
best when it deals with language-specific units larger than one
character, and the 'text' abstraction suits this task much better.