Re: Unicode and Scheme (tweaked)

Show/hide message thread
Re: Unicode and Scheme Paul Schlie (08 Feb 2004 20:57 UTC)
Re: Unicode and Scheme (tweaked) Paul Schlie (08 Feb 2004 21:27 UTC)
Re: Unicode and Scheme (tweaked) Tom Lord (08 Feb 2004 22:23 UTC)
Re: Unicode and Scheme (tweaked) Paul Schlie (09 Feb 2004 02:13 UTC)
Re: Unicode and Scheme (tweaked) Tom Lord 08 Feb 2004 22:39 UTC

Hopefully this coming week the editors will agree to promote my
submission to draft status, at which point this discussion should
probably move there.  Nevertheless, the CHAR? and STRING? types _do_
play a significant role in FFIs so I don't feel _to_ bad about
continuing this here for now.

    > -- refined earlier post --

    > Re: "This SRFI is based in part on the presumption that one should be able
    > to write a portable Scheme program which can accurately read and manipulate
    > source texts in any implementation, even if those source texts contain
    > characters specific to that implementation."

    > Personally, I believe that it's a mistake to attempt to more abstractly
    > extend the interpretation of scheme's standard character type and associated
    > functions, which are presently specified in such a way to enable their
    > implementation to be based on the host's platform's native 8-bit byte
    > character encoding which may be synonymous the platform's raw octet data
    > interfaces (thereby enabling various scheme implementation's historical
    > ability to manipulate raw data byte streams as character sequences, which
    > may actually encode what ever one needs to; which these proposals begin to
    > indirectly break by prohibiting the ability to maintain that equivalence,
    > without offering an alternative).

My view:

Scheme does need a clear, clean way to manipulate octets.  SRFI-4
("Homogenous numeric vector types") helps in that regard.   A SRFI
of standard procedures for signed and unsigned octet arithmetic might
help.   I/O procedures would help.

Scheme also needs a clear, clean way to manipulate source texts and
symbol names.  The requirements of "global programming" suggest to me
at least that these can not possibly be octet-oriented.

Historically, briefly (a few decades), we've lived through a period in
which there was a pun in which characters fit in octets and there was
no practical reason to distinguish them.   But realistically, this is
just an accident -- not something fundamental that ought to be carved
into Scheme standards.

So there is a choice of which direction to take the CHAR? type: it
might take one fork and become "an octet" or another direction and
become "a character".  I don't see any reason why "character" isn't
the better choice.   We don't need a symbolic syntax for octets.   We
don't need case-independence for octets.   We do need characters for
reflective/metacircular programming.   CHAR? should be characters, and
characters, we should now recognize, are not octets.

    > However, it is likely true that scheme's character set and associated
    > function specification should be tightened up a little bit even in this
    > regard; so as feedback on this aspect of the proposals:

    > - character-set and lexical ordering could be improved along these lines:

    >   digit:        0 .. 9

    >   letter:       A a .. Z z           ;; where A a .. F f also hexdigits

That ordering of letters would be: upwards incompatible with R5RS and
R4RS (at least); inconsistent with ASCII; inconsistent with Unicode,
redundent with the -ci ordering predicates.

    >   symbol:       ( ) # ' ` , @ . "    ;; for consistency lexical ordering
    >                 ; $ % & * / : + -    ;; could/should be defined/improved
    >                 ^ _ ~ \ < = > ?
    >                 { } [ ] | !          ;; which should also be included

That would, again, be incompatible with R5RS, R4RS, ASCII, and Unicode.

    >   space:        space tab newline    ;; as well as other
    >   tab:        <unmapped-values>    ;; unspecified character/byte codes

I don't quite follow what you are suggesting or why it would improve anything.

    > - lexical ordering should be refined as above to be more typically useful:
    >   (char<? #\A #\a ... #\Z #\z) -> #t
    >   (char<? <digit> <letter> <symbol> <space> <other>) -> #t

Why would that be useful?  It would prevent Scheme's from using the
"natural" orderings of ASCII, Unicode, and other character sets.

    > - only <letter> characters have different upper/lower case representations;
    >   all other character encodings, including those unspecified, are unaltered
    >   by upper-case, lower-case, and read/write-port functions:

    >   (char-upper-case? <digit> #\A..#\Z <symbol> <space> <other>) -> #t
    >   (char-lower-case? <digit> #\a..#\z <symbol> <space> <other>) -> #t

It would be a strange fiction to say that digits, punctuation, space,
or other non-letter characters are both upper and lower case, although
I don't think anything in my proposals precludes an implementation
from doing so.

    >   (char-upper-case? #\a..#\z) -> #f
    >   (char-lower-case? #\A..#\Z) -> #f

    >   (char=? (char-upper-case (char-lower-case x)) (char-upper-case x)) -> #t
    >   (char=? (char-lower-case (char-upper-case x)) (char-lower-case x)) -> #t

    >   for all x <letter> characters:
    >   (char=? (char-upper-case x) (char-lower-case x)) -> #f

    >   for all x non <letter> characters:
    >   (char=? x (char-upper-case x) (char-lower-case x)) -> #t

    >   for all x characters:
    >   (char-ci=? x (char-upper-case x) (char-lower-case x)) -> #t

There is sort of a general design choice.   You can:

	1) define the case-mapping procedures of Scheme to have a
           certain highly regular "algebraic" structure, as you
           are proposing.

        2) define the case-mapping procedures of Scheme so that they
           can reflect the structure of character sets people have
           designed to deal with the actual complexities of human
           writing.

I tend to think that the old "as simple as possible but no simpler"
line applies here and that, generally, (2) is preferred.   R5RS looks
to me like, for CHAR-ALPHABETIC? characters, the authors intended (2)
but made the mistake of assuming that (2) and (1) are mutually
consistent.

Therefore, I've proposed that CHAR-ALPHABETIC? be retained as the
class of characters that (roughly) satisfy the kind of (1) you are
after;  and CHAR-LETTER? be introduced that is consistent with (2).
That's the only upward compatible way I see forward.

    > - all characters are assumed to be encoded as bytes using the host's
    >   native encoding representation, thereby enabling equivalence between
    >   the host's native raw byte data I/O and storage, and an implementation's
    >   character-set encoding.

I think a byte is simply an INTEGER? which is EXACT? --- either in the
range -128..127 or 0..255.   What's missing are procedures to perform
suitable modular and 2s-complement binary operations on such integers.

    > - portability of the native platform's encoded text is the responsibility
    >   of the host platform and/or other external utilities aware of the
    >   transliterations requirements between the various encoding
    >   formats.

Don't confuse encoding with character-set.  There are, for example,
many ways to encode a given Unicode codepoint stream as a sequence of
octets -- but at the endpoints of two such streams, two Schemes should
agree about whether or not READ has returned EQ? symbols.

    > - implementations which desire to support specific character set encoding
    >   which may require I/O port transliteration between scheme's presumed
    >   platform neutral character/byte encodings and that of it's native host,
    >   may do so by defining a collection of functions which map an arbitrary
    >   specific character set encoding into scheme's neutral character/byte
    >   sequences as required; and/or may extend the definition of standard
    >   function definitions as long as they do not alter the presumed neutrality
    >   and binary equivalence between scheme's character/byte data sequence
    >   representation and that of it's host.

I don't follow you there.

    > (lastly, the notion of enabling scheme symbols to be composed of arbitrary
    >  extended character set characters which may not be portably displayed on
    >  or easily manipulated on arbitrary platforms, is clearly antithetical to
    >  achieving portability; so it's suggestion should just be dropped.)

On the contrary.   Display and manipulation by other tools is and will
continue to catch up.   There should be some Unicode SRFIs precisely
because that is the best bet for portability.

    > Although I know that these views may not be shared by many, I don't believe
    > that scheme should be indirectly restricted to only being able to interface
    > to a text only world (regardless of it's encoding); and hope that some
    > recognize that these proposals begin to restrict the applicability of scheme
    > in just that way, without providing an alternative mechanism to facilitate
    > scheme's ability to access and manipulate raw binary, which is that all
    > truly flexible programming languages with any legs must do; as the computing
    > world is a tad larger than assuming all that needs to be processed and
    > interfaced with is text encoded in some specific way.

I think you'll find that almost everyone agrees that Scheme needs
facilities for "raw binary" manipulation.   I think the mistake you
make is assuming that these facilities must be made the job of the
CHAR? and STRING? types.    Octets are integers, not characters.

-t