String comparison under Latin-1 and Unicode
Ben Goetter 10 Mar 2000 18:26 UTC
>... collation and string
> comparison in the wide Unicode world today. If I can't come up with
something
> reasonable that works in ASCII, Latin-1 *and* a Unicode setting
The STRING>? problem under Unicode differs from the problem under Latin-1
only in degree. (Finns and Swedes use a different collation sequence from
Danes and Norwegians. "AE" is a ligated character in English, but not in
Danish. Spanish vs. French vs Traditional Spanish. And much, much more.)
Hence even under Latin-1, STRING>? must take the domain language into
account. Unicode merely makes more scripts - and so more languages -
convenient.
Proposal:
The string comparators take an optional final argument that is not of type
string, but a new type, language-specifier (abbrev. langid), which specifies
the language of a block of text. The procedure CURRENT-LANGUAGE returns the
langid for whatever language Scheme uses for string comparators lacking this
optional final argument. Scheme initially uses some default langid that it
inherits from its host environment; the procedure DEFAULT-LANGUAGE returns
the langid for this default. The procedures CALL-WITH-LANGUAGE <i>langid
proc</i> and WITH-LANGUAGE <i>langid thunk</i> change the value returned by
CURRENT-LANGUAGE. Finally, the procedure LANGUAGE takes the ISO 639
language code, specified as a string, and returns the correct langid.
LANGUAGE may be extended to take other values (perhaps a numeric language
code from the host OS).
This would allow correct collation of text using the current Scheme notion
of "string." Building a higher-level "text" abstraction from this is purely
mechanical.
Ben