the discussion so far
Matthew Flatt
(16 Jul 2005 12:41 UTC)
|
||
(missing)
|
||
(missing)
|
||
(missing)
|
||
Re: the discussion so far bear (20 Jul 2005 02:45 UTC)
|
||
Re: the discussion so far
John.Cowan
(20 Jul 2005 03:56 UTC)
|
||
(missing)
|
||
Re: the discussion so far
Alex Shinn
(20 Jul 2005 02:50 UTC)
|
||
Re: the discussion so far
Thomas Bushnell BSG
(20 Jul 2005 02:56 UTC)
|
||
Re: the discussion so far
Alex Shinn
(20 Jul 2005 03:15 UTC)
|
||
Re: the discussion so far
Thomas Bushnell BSG
(20 Jul 2005 03:24 UTC)
|
||
Re: the discussion so far
Alex Shinn
(20 Jul 2005 03:38 UTC)
|
||
Re: the discussion so far
Thomas Bushnell BSG
(20 Jul 2005 03:49 UTC)
|
||
Re: the discussion so far
John.Cowan
(20 Jul 2005 04:24 UTC)
|
||
Re: the discussion so far
Thomas Bushnell BSG
(20 Jul 2005 04:27 UTC)
|
||
Re: the discussion so far
John.Cowan
(20 Jul 2005 04:58 UTC)
|
||
Re: the discussion so far
Thomas Bushnell BSG
(20 Jul 2005 05:04 UTC)
|
||
Re: the discussion so far
Jorgen Schaefer
(16 Jul 2005 13:05 UTC)
|
||
Re: the discussion so far
Matthew Flatt
(16 Jul 2005 13:21 UTC)
|
||
Re: the discussion so far
Jorgen Schaefer
(16 Jul 2005 13:58 UTC)
|
||
Re: the discussion so far
Thomas Bushnell BSG
(17 Jul 2005 02:42 UTC)
|
||
Re: the discussion so far
Thomas Bushnell BSG
(17 Jul 2005 02:57 UTC)
|
||
Re: the discussion so far
Jorgen Schaefer
(17 Jul 2005 03:33 UTC)
|
||
Re: the discussion so far
bear
(16 Jul 2005 18:07 UTC)
|
||
Re: the discussion so far
John.Cowan
(17 Jul 2005 04:49 UTC)
|
||
Re: the discussion so far
Thomas Bushnell BSG
(17 Jul 2005 02:40 UTC)
|
On Tue, 19 Jul 2005, John.Cowan wrote: >bear scripsit: >> Right; substrings that aren't valid strings, or which combine into >> something that isn't the original string, can result when you split >> grapheme clusters; This happens when you take substrings on arbitrary >> codepoint boundaries, or do buffered operations on arbitrary codepoint >> boundaries, or any of a number of other things. > >These things turn out not to be the case. They are true if you split >strings on arbitrary *octet* or *code unit* boundaries, but if you >stick to *codepoint* boundaries, they are not true. Any sequence of >codepoints is a valid string, and no amount of taking apart and putting >back together can change the validity or the interpretation of the string. The particular example I'm thinking of is splitting strings between base codepoint and combining codepoint. You get two substrings, and the second one is syntactically invalid. If you print the first substring and then the second, the combining codepoint is usually printed as though it modified a space character that isn't actually there. If something normalizes the substrings first, the space may actually be added, although it wasn't present in the original string. >The description of grapheme clusters in Unicode makes it clear that they >are neither correct nor complete in all circumstances, just yet another >global definition that provides a fairly good approximation. In my opinion, they provide a *MUCH* better approximation to what a "character" actually is than codepoints do. > You do realize that there is a countable infinity of different grapheme > clusters? Yes. They're like integers that way - a useful type. >> This introduces a distinction between text ports (which read and write >> characters, full-stop) and binary ports (which read and write octets). >> If you want to read or write characters on a binary port, you *SHOULD* >> have to state explicitly what encoding to use. > >Indeed. That, however, has to do with encodings, not normalization forms. Gah. Encodings, normalization forms, endianness, and all the rest of it. When you want to write a "character" any of a dozen things can happen. Bear