Re: text processes vs. string procedures
Sergei Egorov 24 Jan 2000 22:41 UTC
Olin Shiver writes:
[...]
> - However, I think case-mapping and string-comparison are basic things, and
> they can be given a generic, portable definition independent of the
> underlying character encoding. Case-mapping does *not* require strings to be
> well-formed text. ASCII, Latin-1 and Unicode all provide a clear,
> language-independent definitions of this operation.
>
> I don't want the string library to be minimal. I want it to be useful.
> People -- many of whom currently program with Latin-1 or ASCII Schemes --
> case-map and compare strings frequently. These operations can be provided
> with an API which is portable across ASCII, Latin-1 and Unicode. So there's
> no barrier here.
I understand your concern; many people do use ASCII and Latin-1 case mapping
and are happy with what they get from the good old char-upcase and char-downcase.
And I am not against char-upcase and char-downcase as long as their definition
is limited to ASCII; otherwise you will have to ignore three problems
mentioned in the Unicode book: uppercase I may map to either i or dotless i
(in Turkish), two uppercase letters SS may map to a single lowercase
sharp s in German, and this thing with French \'e. We are lucky that
there are just three problems with case folding, but collation is
*much* worse. My suggestion would be to restrict char-upcase,
char-downcase, and their derivatives to ASCII and explicitly
specify that string>? and other comparisons are based on
mechanical code-point comparison that might not correspond
to any 'natural' comparison in a real language. This approach
makes the library reasonably useful, simple to implement, and
really fast. I believe that attempting to define language-dependent
interface to collation based on strings is wrong: collation works
best when it deals with language-specific units larger than one
character, and the 'text' abstraction suits this task much better.