Re: Surrogates and character representation Tom Emerson 24 Jul 2005 20:18 UTC
Alan Watson writes: > Using UTF-8 internally for a Scheme on a Plan 9 system is not obviously > a bad idea. Sure, you don't have direct indexing, but you avoid > conversion when you talk to the C library and OS. True enough. > Using UTF-16 internally doesn't give you direct indexing because of > characters outside the BMP, but it might make sense on Windows boxes for > precisely the same reason. This is a valid point. Python took the view that by default UTF-16 is used internally then direct indexing into a string could yield part of a surrogate pair. The feeling (as I remember, I may be wrong) was that astral plane characters are rare-enough that the common-case (i.e., BMP) should not be penalized. > Using UCS-32 internally in these cases would involve translation to talk > to the library and OS and would further make my emacs use about four > times as much memory as it does now (which brings us back the the > representation for infinity). Yes, though the glibc folks decided that the wchar_t type be a 4-byte Unicode value. Python gives you the option of building with a 4-byte or 2-byte "Unicode" character. (In Python Unicode and "narrow" strings are separate types.) > In general, any single representation is a bad idea in some circumstances. Absolutely. -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"