Re: Unicode and Scheme
Paul Schlie
(08 Feb 2004 20:57 UTC)
|
Re: Unicode and Scheme (tweaked) Paul Schlie (08 Feb 2004 21:27 UTC)
|
Re: Unicode and Scheme (tweaked)
Tom Lord
(08 Feb 2004 22:23 UTC)
|
Re: Unicode and Scheme (tweaked)
Paul Schlie
(09 Feb 2004 02:13 UTC)
|
To be slightly more rigorous, character/byte values not representing members of the standard scheme character set should be specified as <other>, giving: -- refined earlier post -- Re: "This SRFI is based in part on the presumption that one should be able to write a portable Scheme program which can accurately read and manipulate source texts in any implementation, even if those source texts contain characters specific to that implementation." Personally, I believe that it's a mistake to attempt to more abstractly extend the interpretation of scheme's standard character type and associated functions, which are presently specified in such a way to enable their implementation to be based on the host's platform's native 8-bit byte character encoding which may be synonymous the platform's raw octet data interfaces (thereby enabling various scheme implementation's historical ability to manipulate raw data byte streams as character sequences, which may actually encode what ever one needs to; which these proposals begin to indirectly break by prohibiting the ability to maintain that equivalence, without offering an alternative). However, it is likely true that scheme's character set and associated function specification should be tightened up a little bit even in this regard; so as feedback on this aspect of the proposals: - character-set and lexical ordering could be improved along these lines: digit: 0 .. 9 letter: A a .. Z z ;; where A a .. F f also hexdigits symbol: ( ) # ' ` , @ . " ;; for consistency lexical ordering ; $ % & * / : + - ;; could/should be defined/improved ^ _ ~ \ < = > ? { } [ ] | ! ;; which should also be included space: space tab newline ;; as well as tab other: <unmapped-values> ;; unspecified character/byte codes - lexical ordering should be refined as above to be more typically useful: (char<? #\A #\a ... #\Z #\z) -> #t (char<? <digit> <letter> <symbol> <space> <other>) -> #t - only <letter> characters have different upper/lower case representations; all other character encodings, including those unspecified, are unaltered by upper-case, lower-case, and read/write-port functions: (char-upper-case? <digit> #\A..#\Z <symbol> <space> <other>) -> #t (char-lower-case? <digit> #\a..#\z <symbol> <space> <other>) -> #t (char-upper-case? #\a..#\z) -> #f (char-lower-case? #\A..#\Z) -> #f (char=? (char-upper-case (char-lower-case x)) (char-upper-case x)) -> #t (char=? (char-lower-case (char-upper-case x)) (char-lower-case x)) -> #t for all x <letter> characters: (char=? (char-upper-case x) (char-lower-case x)) -> #f for all x non <letter> characters: (char=? x (char-upper-case x) (char-lower-case x)) -> #t for all x characters: (char-ci=? x (char-upper-case x) (char-lower-case x)) -> #t - all characters are assumed to be encoded as bytes using the host's native encoding representation, thereby enabling equivalence between the host's native raw byte data I/O and storage, and an implementation's character-set encoding. - portability of the native platform's encoded text is the responsibility of the host platform and/or other external utilities aware of the transliterations requirements between the various encoding formats. - implementations which desire to support specific character set encoding which may require I/O port transliteration between scheme's presumed platform neutral character/byte encodings and that of it's native host, may do so by defining a collection of functions which map an arbitrary specific character set encoding into scheme's neutral character/byte sequences as required; and/or may extend the definition of standard function definitions as long as they do not alter the presumed neutrality and binary equivalence between scheme's character/byte data sequence representation and that of it's host. (lastly, the notion of enabling scheme symbols to be composed of arbitrary extended character set characters which may not be portably displayed on or easily manipulated on arbitrary platforms, is clearly antithetical to achieving portability; so it's suggestion should just be dropped.) Although I know that these views may not be shared by many, I don't believe that scheme should be indirectly restricted to only being able to interface to a text only world (regardless of it's encoding); and hope that some recognize that these proposals begin to restrict the applicability of scheme in just that way, without providing an alternative mechanism to facilitate scheme's ability to access and manipulate raw binary, which is that all truly flexible programming languages with any legs must do; as the computing world is a tad larger than assuming all that needs to be processed and interfaced with is text encoded in some specific way. Thanks for your patience, and hopeful consideration, -paul- ------ End of Forwarded Message