Re: Why are byte ports "ports" as such? John Cowan 23 May 2006 20:43 UTC

Per Bothner scripsit:

> A little knowledge is a dangerous thing ...

	A little Learning is a dang'rous Thing;
	Drink deep, or taste not the Pierian Spring:
	There shallow Draughts intoxicate the Brain,
	And drinking largely sobers us again.

		--Pope, "Essay on Criticism"

> You're contradicting yourself: I asked about a use-case for *character*
> as a separate *data type*.

The advantage of mapping the "character" datatype to Unicode default
grapheme clusters is that it insulates the programmer from the issues
around Unicode normalization.  The disadvantage is that there are a
countable infinity of possible DGCs.

Not all languages have a distinct character datatype, however, and this
has real advantages in a Unicode world: you do not have to think about
just how strings are represented, any more than you have to think about
how bignums are.

> We know that.  However, there is still no need for "character" [in the
> Unicode sense] as a separate data type:

As I noted in my previous posting, "characters in the Unicode sense" is
not a well-defined notion.

> Code that works on compound characters *as a unit* can and should use a
> string type.  Code that needs to look *inside* a compound character,
> needs to works with codepoints.
> In Java, "character" is actually a Unicode code-point.  This is how it
> should be in Scheme, though we might want to replace the 16-bit size
> by a 20-bit size to avoid the complexities of surrogate characters.

Java uses 16-bit code units (not code points), not because the architects
didn't foresee the eventual use of the Astral Planes, but because the
benefits of uniform width were deemed by them to outweigh the necessity
of dealing with surrogate characters by hand.  Java now has some standard
library routines that hide surrogate characters.

However, there are ways to keep uniform-width strings without sacrificing
the codepoint view, provided you are willing to give up on string
mutability (which Java does not have).  One well-known approach is to
store 8-bit code units for strings that contain no codepoint above U+00FF,
16-bit code units for strings that contain no codepoint above U+FFFF,
and 32-bit code units for all other strings.

                Si hoc legere scis, nimium eruditionis habes.