Re: Surrogates and character representation

Show/hide message thread

Re: the "Unicode Background" section Thomas Lord (22 Jul 2005 03:28 UTC)

Surrogates and character representation Tom Emerson (22 Jul 2005 03:55 UTC)

Re: Surrogates and character representation John.Cowan (22 Jul 2005 04:09 UTC)

Re: Surrogates and character representation Tom Emerson (22 Jul 2005 04:26 UTC)

Re: Surrogates and character representation Thomas Bushnell BSG (23 Jul 2005 07:19 UTC)

Re: Surrogates and character representation Tom Emerson (23 Jul 2005 17:38 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 05:37 UTC)

Re: Surrogates and character representation Shiro Kawai (24 Jul 2005 08:15 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 13:25 UTC)

Re: Surrogates and character representation Alan Watson (24 Jul 2005 17:32 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 17:54 UTC)

Re: Surrogates and character representation Alan Watson (24 Jul 2005 18:15 UTC)

Re: Surrogates and character representation Tom Emerson (24 Jul 2005 20:18 UTC)

Re: Surrogates and character representation Per Bothner (24 Jul 2005 18:25 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 23:02 UTC)

Re: Surrogates and character representation Per Bothner (24 Jul 2005 23:26 UTC)

Re: Surrogates and character representation Alan Watson (25 Jul 2005 17:24 UTC)

Re: Surrogates and character representation bear (27 Jul 2005 16:16 UTC)

Re: Surrogates and character representation John.Cowan (24 Jul 2005 22:12 UTC)

Re: Surrogates and character representation Ken Dickey (24 Jul 2005 09:35 UTC)

Re: Surrogates and character representation Michael Sperber (24 Jul 2005 11:47 UTC)

Re: the "Unicode Background" section Matthew Flatt (22 Jul 2005 04:30 UTC)

Re: the "Unicode Background" section Alex Shinn (22 Jul 2005 05:42 UTC)

Re: the "Unicode Background" section bear (22 Jul 2005 15:45 UTC)

Re: the "Unicode Background" section Tom Emerson (22 Jul 2005 15:56 UTC)

Re: Surrogates and character representation Tom Emerson 24 Jul 2005 20:18 UTC

Alan Watson writes:
> Using UTF-8 internally for a Scheme on a Plan 9 system is not obviously
> a bad idea. Sure, you don't have direct indexing, but you avoid
> conversion when you talk to the C library and OS.

True enough.

> Using UTF-16 internally doesn't give you direct indexing because of
> characters outside the BMP, but it might make sense on Windows boxes for
> precisely the same reason.

This is a valid point. Python took the view that by default UTF-16 is
used internally then direct indexing into a string could yield part of
a surrogate pair. The feeling (as I remember, I may be wrong) was that
astral plane characters are rare-enough that the common-case (i.e.,
BMP) should not be penalized.

> Using UCS-32 internally in these cases would involve translation to talk
> to the library and OS and would further make my emacs use about four
> times as much memory as it does now (which brings us back the the
> representation for infinity).

Yes, though the glibc folks decided that the wchar_t type be a 4-byte
Unicode value. Python gives you the option of building with a 4-byte
or 2-byte "Unicode" character. (In Python Unicode and "narrow" strings
are separate types.)

> In general, any single representation is a bad idea in some circumstances.

Absolutely.

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"