the discussion so far Matthew Flatt (16 Jul 2005 12:41 UTC)
(missing)
(missing)
(missing)
Re: the discussion so far bear (20 Jul 2005 02:45 UTC)
Re: the discussion so far John.Cowan (20 Jul 2005 03:56 UTC)
(missing)
Re: the discussion so far Alex Shinn (20 Jul 2005 02:50 UTC)
Re: the discussion so far Thomas Bushnell BSG (20 Jul 2005 02:56 UTC)
Re: the discussion so far Alex Shinn (20 Jul 2005 03:15 UTC)
Re: the discussion so far Thomas Bushnell BSG (20 Jul 2005 03:24 UTC)
Re: the discussion so far Alex Shinn (20 Jul 2005 03:38 UTC)
Re: the discussion so far Thomas Bushnell BSG (20 Jul 2005 03:49 UTC)
Re: the discussion so far John.Cowan (20 Jul 2005 04:24 UTC)
Re: the discussion so far Thomas Bushnell BSG (20 Jul 2005 04:27 UTC)
Re: the discussion so far John.Cowan (20 Jul 2005 04:58 UTC)
Re: the discussion so far Thomas Bushnell BSG (20 Jul 2005 05:04 UTC)
Re: the discussion so far Jorgen Schaefer (16 Jul 2005 13:05 UTC)
Re: the discussion so far Matthew Flatt (16 Jul 2005 13:21 UTC)
Re: the discussion so far Jorgen Schaefer (16 Jul 2005 13:58 UTC)
Re: the discussion so far Thomas Bushnell BSG (17 Jul 2005 02:42 UTC)
Re: the discussion so far Thomas Bushnell BSG (17 Jul 2005 02:57 UTC)
Re: the discussion so far Jorgen Schaefer (17 Jul 2005 03:33 UTC)
Re: the discussion so far bear (16 Jul 2005 18:07 UTC)
Re: the discussion so far John.Cowan (17 Jul 2005 04:49 UTC)
Re: the discussion so far Thomas Bushnell BSG (17 Jul 2005 02:40 UTC)

Re: the discussion so far bear 20 Jul 2005 02:45 UTC


On Tue, 19 Jul 2005, John.Cowan wrote:

>bear scripsit:

>> Right; substrings that aren't valid strings, or which combine into
>> something that isn't the original string, can result when you split
>> grapheme clusters; This happens when you take substrings on arbitrary
>> codepoint boundaries, or do buffered operations on arbitrary codepoint
>> boundaries, or any of a number of other things.
>
>These things turn out not to be the case.  They are true if you split
>strings on arbitrary *octet* or *code unit* boundaries, but if you
>stick to *codepoint* boundaries, they are not true.  Any sequence of
>codepoints is a valid string, and no amount of taking apart and putting
>back together can change the validity or the interpretation of the string.

The particular example I'm thinking of is splitting strings
between base codepoint and combining codepoint. You get two
substrings, and the second one is syntactically invalid.
If you print the first substring and then the second, the
combining codepoint is usually printed as though it modified
a space character that isn't actually there.  If something
normalizes the substrings first, the space may actually be
added, although it wasn't present in the original string.

>The description of grapheme clusters in Unicode makes it clear that they
>are neither correct nor complete in all circumstances, just yet another
>global definition that provides a fairly good approximation.

In my opinion, they provide a *MUCH* better approximation to
what a "character" actually is than codepoints do.

> You do realize that there is a countable infinity of different grapheme
> clusters?

Yes.  They're like integers that way - a useful type.

>> This introduces a distinction between text ports (which read and write
>> characters, full-stop) and binary ports (which read and write octets).
>> If you want to read or write characters on a binary port, you *SHOULD*
>> have to state explicitly what encoding to use.
>
>Indeed.  That, however, has to do with encodings, not normalization forms.

Gah.  Encodings, normalization forms, endianness, and all the
rest of it.  When you want to write a "character" any of a dozen
things can happen.

				Bear