Re: the discussion so far John.Cowan 19 Jul 2005 19:07 UTC

bear scripsit:

> [...] specifying only the Right Thing [...]

Unicode is not about the Right Thing; it's about doing the best thing
possible in the circumstances.

> This problem is much less severe if your characters are grapheme
> clusters.  But thanks to Ligatures (which cannot appear in canonical
> strings) and eszett (which, unfortunately, can) it is not completely
> eliminated by moving to grapheme clusters.

There are four kinds of normalization, only two of which (and the two
least commonly used) remove ligatures.  Canonical normalizations do not
remove ligatures, with the sole exception of U+FB1F, HEBREW LIGATURE
YIDDISH YOD YOD PATAH.  Compatibility normalizations do remove most
ligatures (there are some characters called LIGATURE for historical
reasons which do not function as ligatures).

> Right; substrings that aren't valid strings, or which combine into
> something that isn't the original string, can result when you split
> grapheme clusters; This happens when you take substrings on arbitrary
> codepoint boundaries, or do buffered operations on arbitrary codepoint
> boundaries, or any of a number of other things.

These things turn out not to be the case.  They are true if you split
strings on arbitrary *octet* or *code unit* boundaries, but if you
stick to *codepoint* boundaries, they are not true.  Any sequence of
codepoints is a valid string, and no amount of taking apart and putting
back together can change the validity or the interpretation of the string.

> But these are
> problems that go away if your characters are grapheme clusters.

The description of grapheme clusters in Unicode makes it clear that they
are neither correct nor complete in all circumstances, just yet another
global definition that provides a fairly good approximation.

> I don't believe in this.  If you're going to limit it to ASCII,
> then 'ascii' ought to be in its name.

I agree.

> The thing is, if underspecified these operations will be
> nearly useless.  Portable code will be unable to rely on
> them doing any particular thing.

I agree with that too.

> It's my opinion that the only way to make normalization transparent to
> the programmer and user is to use grapheme-cluster characters instead
> of codepoint characters.  Normalization consists in altering codepoint
> sequences within grapheme clusters only; if this is your character
> unit, then it can be done without disrupting character indexes or
> counts, saving everyone a lot of headaches.

You do realize that there is a countable infinity of different grapheme
clusters?

> One thing about normalization: Ligatures do not exist in normalized
> text, because they have canonical decompositions.

Not so; see above.

> As for conversion between different normalized forms, I think that the
> unicode normalization form is properly a property of the port through
> which data is read or written.  The port reads codepoints in some
> normalization form, and delivers _characters_ represented according to
> the abstraction you use internally.  Likewise, it accepts abstract
> characters and writes codepoints in some normalization form.

That's an interesting idea, but IMHO too radical at present.

> This introduces a distinction between text ports (which read and write
> characters, full-stop) and binary ports (which read and write octets).
> If you want to read or write characters on a binary port, you *SHOULD*
> have to state explicitly what encoding to use.

Indeed.  That, however, has to do with encodings, not normalization forms.

--
John Cowan    http://www.ccil.org/~cowan   <xxxxxx@reutershealth.com>
    "Any legal document draws most of its meaning from context.  A telegram
    that says 'SELL HUNDRED THOUSAND SHARES IBM SHORT' (only 190 bits in
    5-bit Baudot code plus appropriate headers) is as good a legal document
    as any, even sans digital signature." --me