Email list hosting service & mailing list manager

collation algorithm bear (16 Jul 2005 08:52 UTC)
Re: collation algorithm John.Cowan (16 Jul 2005 16:04 UTC)

Re: collation algorithm John.Cowan 16 Jul 2005 16:04 UTC

bear scripsit:

> The proposed semantics for collation of strings
> (using string>? & friends) by pointwise comparison
> is in direct conflict with the unicode standard
> for locale-independent collation of strings, as
> expressed in
>
> http://www.unicode.org/reports/tr10/

Note that the Unicode Collation Algorithm is not, strictly speaking,
part of the Unicode standard; it even has its own ISO number (14651
rather than 10646).  Compliance to the Unicode Standard neither
requires nor forbids conformance to the UCA.

> The unicode collation algorithm abstracts over
> representation issues such as how characters are
> rendered as sequences of individual codepoints,
> making the test for canonical (glyph) equivalence
> rather than codepoint equivalence.

(You're misusing the term "glyph"; see the Unicode Glossary.
I assume you mean something close to "grapheme".)

> Since I figure most language implementors will ignore
> it (and *are* ignoring it, in Java and C#) this part
> of the Unicode standard will probably eventually be
> abandoned.

That turns out not to be the case.  :-)

For Java, you can use either fast (binary) or smart (UCA) comparison
routines: the former are provided in the java.lang.String class, the
latter by java.text.Collator and related classes.  (The latter include the
UCA's provisions for tailoring collation order for specific locales: for
example, to make ä sort after z, as Swedes expect, rather than with a,
its normal place.)  UCA collation is also readily available for C and C++
programs via IBM's open-source ICU library.

> At the same time, I want to leave it legal for
> scheme implementors who are actually doing unicode
> support to conform to it if they want to.

That can be done by leaving the *-ci? procedures alone and allowing
implementers to provide their own UCA-compliant procedures.

--
John Cowan      http://www.ccil.org/~cowan      xxxxxx@reutershealth.com
Be yourself.  Especially do not feign a working knowledge of RDF where
no such knowledge exists.  Neither be cynical about RELAX NG; for in
the face of all aridity and disenchantment in the world of markup,
James Clark is as perennial as the grass.  --DeXiderata, Sean McGrath