Re: Why are byte ports "ports" as such? Per Bothner 23 May 2006 21:05 UTC

John Cowan wrote:
> Per Bothner scripsit:
>> A little knowledge is a dangerous thing ...
> 	A little Learning is a dang'rous Thing;
> 		--Pope, "Essay on Criticism"

Touché ...

>> We know that.  However, there is still no need for "character" [in the
>> Unicode sense] as a separate data type:
> As I noted in my previous posting, "characters in the Unicode sense" is
> not a well-defined notion.

Yes - and that's why I'm arguing against trying to model anything except
codepoints in Scheme,

> Java uses 16-bit code units (not code points), not because the architects
> didn't foresee the eventual use of the Astral Planes, but because the
> benefits of uniform width were deemed by them to outweigh the necessity
> of dealing with surrogate characters by hand.  Java now has some standard
> library routines that hide surrogate characters.

Unfortunately, the end result is somewhat complex, especially since 99%
of the time programmers can and will get away with ignoring non-basic-
plane characters.

> However, there are ways to keep uniform-width strings without sacrificing
> the codepoint view, provided you are willing to give up on string
> mutability (which Java does not have).  One well-known approach is to
> store 8-bit code units for strings that contain no codepoint above U+00FF,
> 16-bit code units for strings that contain no codepoint above U+FFFF,
> and 32-bit code units for all other strings.

Personally, if I didn't have any compatibility constraints, I would just
store everything as UTF-8 string, and allow indexing by code unit
(bytes).  How often does non-library code need to deal with characters?
Instead, the data types should be (immutable) "string" and "buffer".
The latter allows insertions and deletions in addition to replacement.
How often are strings in the sense of mutable fixed-length character
arrays useful to application programmers, except as a low-level
"chunk of memory" to implement other data types?  Basically never,
or as close to never as to render them unsuitable for Scheme.

(Even parsers don't need to deal with characters, if you have
regular-expression lexing.  I.e. try to match the current input
position against a regular expression.  On success, return the
matched string, and move the position forwards.)
	--Per Bothner