Re: Unicode and Scheme (tweaked)

Show/hide message thread
Re: Unicode and Scheme Paul Schlie (08 Feb 2004 20:57 UTC)
Re: Unicode and Scheme (tweaked) Paul Schlie (08 Feb 2004 21:27 UTC)
Re: Unicode and Scheme (tweaked) Tom Lord (08 Feb 2004 22:23 UTC)
Re: Unicode and Scheme (tweaked) Paul Schlie (09 Feb 2004 02:13 UTC)
Re: Unicode and Scheme (tweaked) Paul Schlie 09 Feb 2004 02:13 UTC
> Tom Lord <xxxxxx@emf.net> wrote:
>
> Hopefully this coming week the editors will agree to promote my
> submission to draft status, at which point this discussion should
> probably move there.  Nevertheless, the CHAR? and STRING? types _do_
> play a significant role in FFIs so I don't feel _to_ bad about
> continuing this here for now.
>
>> -- refined earlier post --
>
>> Re: "This SRFI is based in part on the presumption that one should be able
>> to write a portable Scheme program which can accurately read and manipulate
>> source texts in any implementation, even if those source texts contain
>> characters specific to that implementation."
>
>> Personally, I believe that it's a mistake to attempt to more abstractly
>> extend the interpretation of scheme's standard character type and associated
>> functions, which are presently specified in such a way to enable their
>> implementation to be based on the host's platform's native 8-bit byte
>> character encoding which may be synonymous the platform's raw octet data
>> interfaces (thereby enabling various scheme implementation's historical
>> ability to manipulate raw data byte streams as character sequences, which
>> may actually encode what ever one needs to; which these proposals begin to
>> indirectly break by prohibiting the ability to maintain that equivalence,
>> without offering an alternative).
>
> My view:
>
> Scheme does need a clear, clean way to manipulate octets.  SRFI-4
> ("Homogenous numeric vector types") helps in that regard.   A SRFI
> of standard procedures for signed and unsigned octet arithmetic might
> help.   I/O procedures would help.

- Homogenous numeric vectors are a step in that direction, but are useless
  for generalized raw data stream processing without having the ability to
  read or write their raw values through an I/O port unaltered, which the
  srfi doesn't seem to try to address; which ironically is presently only
  accomplishable by assuming that characters I/O can serve at the a conduit
  for the binary data intermediately stored in homogeneous vectors for
  processing.

  (which could be largely remedied by extending the semantics of ports
   to understand the notion of raw data streams, which of course could
   then become the basis upon which all I/O operations are based upon,
   including but not limited to encoded unicode code-points; which for
   for good or bad, is what the byte oriented characters and strings
   presently provide, therefore can't be morphed into something more
   abstract, without first defining a new basis for binary data I/O.)

> Scheme also needs a clear, clean way to manipulate source texts and
> symbol names.  The requirements of "global programming" suggest to me
> at least that these can not possibly be octet-oriented.
>
> Historically, briefly (a few decades), we've lived through a period in
> which there was a pun in which characters fit in octets and there was
> no practical reason to distinguish them.   But realistically, this is
> just an accident -- not something fundamental that ought to be carved
> into Scheme standards.
>
> So there is a choice of which direction to take the CHAR? type: it
> might take one fork and become "an octet" or another direction and
> become "a character".  I don't see any reason why "character" isn't
> the better choice.   We don't need a symbolic syntax for octets.   We
> don't need case-independence for octets.   We do need characters for
> reflective/metacircular programming.   CHAR? should be characters, and
> characters, we should now recognize, are not octets.

- my basic point is that because character byte oriented I/O is all that
  scheme has (which isn't terrible, as everything can be encoded within
  native character oriented byte streams, including unicode code-points);
  a standardized alternative basis for arbitrary native/raw data I/O,
  storage, and access must be defined prior (or minimally coincident with)
  the further abstraction of scheme characters beyond simple basic bytes.

>> However, it is likely true that scheme's character set and associated
>> function specification should be tightened up a little bit even in this
>> regard; so as feedback on this aspect of the proposals:
>
>> - character-set and lexical ordering could be improved along these lines:
>
>>   digit:        0 .. 9
>
>>   letter:       A a .. Z z           ;; where A a .. F f also hexdigits
>
> That ordering of letters would be: upwards incompatible with R5RS and
> R4RS (at least); inconsistent with ASCII; inconsistent with Unicode,
> redundent with the -ci ordering predicates.

- irrelevant, scheme's character set is an abstraction, adopting a more
  useful lexical ordering shouldn't be a function of a particular encoding.

>
>>   symbol:       ( ) # ' ` , @ . "    ;; for consistency lexical ordering
>>                 ; $ % & * / : + -    ;; could/should be defined/improved
>>                 ^ _ ~ \ < = > ?
>>                 { } [ ] | !          ;; which should also be included

> That would, again, be incompatible with R5RS, R4RS, ASCII, and Unicode.

- I presume you're referring to ordering, should be independent of encoding,
  although it would have been nicer if they were more similar, however I
  don't suspect most folks would ever expect "Zebra" to sort before "apple"
  for example. (as don't suspect the folks who defined ASCII, redefined
  the lexical ordering of the Roman Alphabet.)

>>   space:        space tab newline   ;; as well as other
>>   tab:        <unmapped-values>     ;; unspecified character/byte codes
>
> I don't quite follow what you are suggesting or why it would improve anything.

- if you're referring to the last line, which was originally:

     other:        <unmapped-values>    ;; unspecified character/byte codes

  it was an attempt to enable the specification of the behavior of functions
  on unspecified character values (i.e. scheme only specifies ~96/256
  possible characters which could be encoded within a byte), so for example:

  for all x non <letter> characters:
  (char=? x (char-upper-case x) (char-lower-case x)) -> #t

  specifies that case conversion functions don¹t alter any encoded character
  values, other than possibly for <letter> characters, by making <other>
  unmapped/unspecified characters inclusive in the definition.

>> - lexical ordering should be refined as above to be more typically useful:
>>   (char<? #\A #\a ... #\Z #\z) -> #t
>>   (char<? <digit> <letter> <symbol> <space> <other>) -> #t
>
> Why would that be useful?  It would prevent Scheme's from using the
> "natural" orderings of ASCII, Unicode, and other character sets.

- only useful if one expected actually wanted to alphabetize,
  alternatively how is it useful to have (char<? #\Z #\a) -> #t ?

>> - only <letter> characters have different upper/lower case representations;
>>   all other character encodings, including those unspecified, are unaltered
>>   by upper-case, lower-case, and read/write-port functions:
>
>>   (char-upper-case? <digit> #\A..#\Z <symbol> <space> <other>) -> #t
>>   (char-lower-case? <digit> #\a..#\z <symbol> <space> <other>) -> #t
>
> It would be a strange fiction to say that digits, punctuation, space,
> or other non-letter characters are both upper and lower case, although
> I don't think anything in my proposals precludes an implementation
> from doing so.

- guess if they're considered case neutral, then alternatively:

  (char-upper-case? <digit>|<symbol>|<space>|<other>) -> #f
  (char-lower-case? <digit>|<symbol>|<space>|<other>) -> #f

  as they're neither upper or lower case, either way don't suspect it
  makes much difference, as case functions are only useful on <letter>s.

>>   (char-upper-case? #\a..#\z) -> #f
>>   (char-lower-case? #\A..#\Z) -> #f
>
>>   (char=? (char-upper-case (char-lower-case x)) (char-upper-case x)) -> #t
>>   (char=? (char-lower-case (char-upper-case x)) (char-lower-case x)) -> #t
>
>>   for all x <letter> characters:
>>   (char=? (char-upper-case x) (char-lower-case x)) -> #f
>
>>   for all x non <letter> characters:
>>   (char=? x (char-upper-case x) (char-lower-case x)) -> #t
>
>>   for all x characters:
>>   (char-ci=? x (char-upper-case x) (char-lower-case x)) -> #t
>
> There is sort of a general design choice.   You can:
>
> 1) define the case-mapping procedures of Scheme to have a
>          certain highly regular "algebraic" structure, as you
>          are proposing.
>
>       2) define the case-mapping procedures of Scheme so that they
>          can reflect the structure of character sets people have
>          designed to deal with the actual complexities of human
>          writing.
>
> I tend to think that the old "as simple as possible but no simpler"
> line applies here and that, generally, (2) is preferred.   R5RS looks
> to me like, for CHAR-ALPHABETIC? characters, the authors intended (2)
> but made the mistake of assuming that (2) and (1) are mutually
> consistent.
>
> Therefore, I've proposed that CHAR-ALPHABETIC? be retained as the
> class of characters that (roughly) satisfy the kind of (1) you are
> after;  and CHAR-LETTER? be introduced that is consistent with (2).
> That's the only upward compatible way I see forward.

- I admit I don't see the that (1) and (2) are mutually exclusive; if
  one wanted to define a more complete set of character categories, as
  they are relevant to scheme language parsing (which in all honestly,
  I believe that scheme's character set & functions are predominantly
  useful for beyond basic English plain text processing), then would
  guess that a few more group convenient non exclusive character
  categories would be useful to define, such as those below, but for
  English text processing, other than being able to differentiate
  between letters, digits, spaces, and symbols, seems sufficient
  for most things.

  slightly more useful character classifications for scheme parsing:

  <digit-hex>   0 1 2 3 4 5 6 7  ;; which by the way represents
                8 9 A a B b C c  ;; by thoughts on the lexical
                D d E e F f      ;; ordering of the first 22 chars

  <symbol-id>   ~ @ # $ % ^ & *  ;; as <symbol> characters which may
                _ - + = | \ : /  ;; also be used in scheme identifiers.
                < > ?

>> - all characters are assumed to be encoded as bytes using the host's
>>   native encoding representation, thereby enabling equivalence between
>>   the host's native raw byte data I/O and storage, and an implementation's
>>   character-set encoding.
>
> I think a byte is simply an INTEGER? which is EXACT? --- either in the
> range -128..127 or 0..255.   What's missing are procedures to perform
> suitable modular and 2s-complement binary operations on such integers.

- although a nit, I tend to think of bytes as being unsigned 0..255; but
  regardless, although (char->integer ) and visa-versa are defined, I
  personally subscribe to the notion that arithmetic operations should
  have been defined such that (+ #\203 #\002) -> #\205 modulo 256 for
  example. (and of course scheme's lack of bit binary operations aren't
  real great either)

>> - portability of the native platform's encoded text is the responsibility
>>   of the host platform and/or other external utilities aware of the
>>   transliterations requirements between the various encoding
>>   formats.
>
> Don't confuse encoding with character-set.  There are, for example,
> many ways to encode a given Unicode codepoint stream as a sequence of
> octets -- but at the endpoints of two such streams, two Schemes should
> agree about whether or not READ has returned EQ? symbols.

- I don't think I have; correspondingly don't confuse encoding with lexical
  ordering, or that by picking a standard encoding you've got portable code.
  (and appreciate that if one were to build a unicode support on top of
   plain old character byte sequences/strings, that a number of unicode
   specific support functions would need to be defined. Actually suspect
   plain old vectors would work just fine for the storage and manipulation
   of unicode code-points, which could then be transliterated to/from plain
   old character byte I/O sequences through plain old ports. :)

>> - implementations which desire to support specific character set encoding
>>   which may require I/O port transliteration between scheme's presumed
>>   platform neutral character/byte encodings and that of it's native host,
>>   may do so by defining a collection of functions which map an arbitrary
>>   specific character set encoding into scheme's neutral character/byte
>>   sequences as required; and/or may extend the definition of standard
>>   function definitions as long as they do not alter the presumed neutrality
>>   and binary equivalence between scheme's character/byte data sequence
>>   representation and that of it's host.
>
> I don't follow you there.

- if an implementation presumes any specific encoding, which is different
  from the host's presumed character encoding, then the data will need to
  transliterated between the two encoding formats most likely by scheme's
  own internal read/write procedures, which then prevents data which is not
  encoded in that format to be passed unmodified to the host. (for example,
  a 0xFC 0x31 UTF8 byte sequence may be then need to be automatically
  translated into a single 0xD2 character in the host's native character set
  set, and visa-versa; thereby preventing a simple raw 0xFC 0x31 character
  sequence to be passed to the host unscathed.

>> (lastly, the notion of enabling scheme symbols to be composed of arbitrary
>>  extended character set characters which may not be portably displayed on
>>  or easily manipulated on arbitrary platforms, is clearly antithetical to
>>  achieving portability; so it's suggestion should just be dropped.)
>
> On the contrary. Display and manipulation by other tools is and will
> continue to catch up.  There should be some Unicode SRFIs precisely
> because that is the best bet for portability.

- "continue to catch up", So you're operating under the premise that its a
  good idea to define a language which is capable of using un-displayable
  characters within it's code, because eventually most computers should
  be able to display them? (the best bet for portability, to restrict the
  characters in which the program is written, to characters which are
  represent-able in most presently known and likely future platforms, which
  is what scheme presently does; which is also why I suspect that the
  character set and corresponding facilities used to manipulate program code
  should be distinct from those used to manipulate generalized and/or
  specifically encoded data and/or text; so that the twain won't meet.)

>> Although I know that these views may not be shared by many, I don't believe
>> that scheme should be indirectly restricted to only being able to interface
>> to a text only world (regardless of it's encoding); and hope that some
>> recognize that these proposals begin to restrict the applicability of scheme
>> in just that way, without providing an alternative mechanism to facilitate
>> scheme's ability to access and manipulate raw binary, which is that all
>> truly flexible programming languages with any legs must do; as the computing
>> world is a tad larger than assuming all that needs to be processed and
>> interfaced with is text encoded in some specific way.
>
> I think you'll find that almost everyone agrees that Scheme needs
> facilities for "raw binary" manipulation.   I think the mistake you
> make is assuming that these facilities must be made the job of the
> CHAR? and STRING? types.    Octets are integers, not characters.

- don't disagree, again however, as scheme's present character I/O functions
  represent the only standard facilities available to scheme to access and
  manipulate binary data; I don't see how those facilities can be refined in
  any non-strictly backward compatible way, unless alternative facilities to
  do the same are previously, or simultaneously provisioned. (so therefore,
  shy of an existing supported binary data I/O proposal, it would seem to
  be the obligation of any proposal of non-strictly backward compatible
  refinements to scheme's character and string types and/or facilities, to
  also propose alternative binary data type and I/O facility standards to
  replace any the lower lever functionality which may be otherwise lost.)

> -t

Tom, thanks for your time, and feedback/observations/criticism/thoughts,

-paul-