|
strings draft
Tom Lord
(22 Jan 2004 04:58 UTC)
|
|
Re: strings draft
Shiro Kawai
(22 Jan 2004 09:46 UTC)
|
|
Re: strings draft
Tom Lord
(22 Jan 2004 17:32 UTC)
|
|
Re: strings draft
Shiro Kawai
(23 Jan 2004 05:03 UTC)
|
|
Re: strings draft
Tom Lord
(24 Jan 2004 00:31 UTC)
|
|
Re: strings draft
Matthew Dempsky
(24 Jan 2004 03:00 UTC)
|
|
Re: strings draft
Shiro Kawai
(24 Jan 2004 03:27 UTC)
|
|
Re: strings draft
Tom Lord
(24 Jan 2004 04:18 UTC)
|
|
Re: strings draft
Shiro Kawai
(24 Jan 2004 04:49 UTC)
|
|
Re: strings draft
Tom Lord
(24 Jan 2004 18:47 UTC)
|
|
Re: strings draft
Shiro Kawai
(24 Jan 2004 22:16 UTC)
|
|
Octet vs Char (Re: strings draft)
Shiro Kawai
(26 Jan 2004 09:58 UTC)
|
|
Re: Octet vs Char (Re: strings draft)
bear
(26 Jan 2004 19:04 UTC)
|
|
Re: Octet vs Char (Re: strings draft)
Matthew Dempsky
(26 Jan 2004 20:12 UTC)
|
|
Re: Octet vs Char (Re: strings draft)
Matthew Dempsky
(26 Jan 2004 20:40 UTC)
|
|
Re: Octet vs Char (Re: strings draft)
Ken Dickey
(27 Jan 2004 04:33 UTC)
|
|
Re: Octet vs Char
Shiro Kawai
(27 Jan 2004 05:12 UTC)
|
|
Re: Octet vs Char
Tom Lord
(27 Jan 2004 05:23 UTC)
|
|
Re: Octet vs Char
bear
(27 Jan 2004 08:35 UTC)
|
|
Re: Octet vs Char (Re: strings draft)
bear
(27 Jan 2004 08:33 UTC)
|
|
Re: Octet vs Char (Re: strings draft)
Ken Dickey
(27 Jan 2004 15:43 UTC)
|
|
Re: Octet vs Char (Re: strings draft)
bear
(27 Jan 2004 19:06 UTC)
|
|
Re: Octet vs Char
Shiro Kawai
(26 Jan 2004 23:39 UTC)
|
|
Strings, one last detail.
bear
(30 Jan 2004 21:12 UTC)
|
|
Re: Strings, one last detail.
Shiro Kawai
(30 Jan 2004 21:43 UTC)
|
|
Re: Strings, one last detail.
Tom Lord
(31 Jan 2004 00:13 UTC)
|
|
Re: Strings, one last detail.
bear
(31 Jan 2004 20:26 UTC)
|
|
Re: Strings, one last detail.
Tom Lord
(31 Jan 2004 20:42 UTC)
|
|
Re: Strings, one last detail.
bear
(01 Feb 2004 02:29 UTC)
|
|
Re: Strings, one last detail.
Tom Lord
(01 Feb 2004 02:44 UTC)
|
|
Re: Strings, one last detail.
bear
(01 Feb 2004 07:53 UTC)
|
|
Re: strings draft
bear
(22 Jan 2004 19:05 UTC)
|
|
Re: strings draft
Tom Lord
(23 Jan 2004 01:53 UTC)
|
|
READ-OCTET (Re: strings draft)
Shiro Kawai
(23 Jan 2004 06:01 UTC)
|
|
Re: strings draft
bear
(23 Jan 2004 07:04 UTC)
|
|
Re: strings draft
bear
(23 Jan 2004 07:20 UTC)
|
|
Re: strings draft
Tom Lord
(24 Jan 2004 00:02 UTC)
|
|
Re: strings draft
Alex Shinn
(26 Jan 2004 01:59 UTC)
|
|
Re: strings draft
Tom Lord
(26 Jan 2004 02:22 UTC)
|
|
Re: strings draft
bear
(26 Jan 2004 02:35 UTC)
|
|
Re: strings draft
Tom Lord
(26 Jan 2004 02:48 UTC)
|
|
Re: strings draft
Alex Shinn
(26 Jan 2004 03:00 UTC)
|
|
Re: strings draft
Tom Lord
(26 Jan 2004 03:14 UTC)
|
|
Re: strings draft
Shiro Kawai
(26 Jan 2004 04:57 UTC)
|
|
Re: strings draft
Alex Shinn
(26 Jan 2004 04:58 UTC)
|
|
Re: strings draft
tb@xxxxxx
(23 Jan 2004 18:48 UTC)
|
|
Re: strings draft
bear
(24 Jan 2004 02:21 UTC)
|
|
Re: strings draft
tb@xxxxxx
(23 Jan 2004 02:10 UTC)
|
|
Re: strings draft
Tom Lord
(23 Jan 2004 02:29 UTC)
|
|
Re: strings draft
tb@xxxxxx
(23 Jan 2004 02:44 UTC)
|
|
Re: strings draft
Tom Lord
(23 Jan 2004 02:53 UTC)
|
|
Re: strings draft
tb@xxxxxx
(23 Jan 2004 03:04 UTC)
|
|
Re: strings draft
Tom Lord
(23 Jan 2004 03:16 UTC)
|
|
Re: strings draft
tb@xxxxxx
(23 Jan 2004 03:42 UTC)
|
|
Re: strings draft
Alex Shinn
(23 Jan 2004 02:35 UTC)
|
|
Re: strings draft
tb@xxxxxx
(23 Jan 2004 02:42 UTC)
|
|
Re: strings draft
Tom Lord
(23 Jan 2004 02:49 UTC)
|
|
Re: strings draft
Alex Shinn
(23 Jan 2004 02:58 UTC)
|
|
Re: strings draft
tb@xxxxxx
(23 Jan 2004 03:13 UTC)
|
|
Re: strings draft
Alex Shinn
(23 Jan 2004 03:19 UTC)
|
|
Re: strings draft
Bradd W. Szonye
(23 Jan 2004 19:31 UTC)
|
|
Re: strings draft
Alex Shinn
(26 Jan 2004 02:22 UTC)
|
|
Re: strings draft
Bradd W. Szonye
(06 Feb 2004 23:30 UTC)
|
|
Re: strings draft
Bradd W. Szonye
(06 Feb 2004 23:33 UTC)
|
|
Re: strings draft
Alex Shinn
(09 Feb 2004 01:45 UTC)
|
|
specifying source encoding (Re: strings draft)
Shiro Kawai
(09 Feb 2004 02:51 UTC)
|
|
Re: strings draft
Bradd W. Szonye
(09 Feb 2004 03:39 UTC)
|
|
Re: strings draft
tb@xxxxxx
(23 Jan 2004 03:12 UTC)
|
|
Re: strings draft
Alex Shinn
(23 Jan 2004 03:28 UTC)
|
|
Re: strings draft
tb@xxxxxx
(23 Jan 2004 03:44 UTC)
|
|
Parsing Scheme [was Re: strings draft]
Ken Dickey
(23 Jan 2004 17:02 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
bear
(23 Jan 2004 17:56 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
tb@xxxxxx
(23 Jan 2004 18:50 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Per Bothner
(23 Jan 2004 18:56 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Tom Lord
(23 Jan 2004 20:26 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Per Bothner
(23 Jan 2004 20:57 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Tom Lord
(23 Jan 2004 21:44 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Tom Lord
(23 Jan 2004 20:07 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
tb@xxxxxx
(23 Jan 2004 21:22 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Tom Lord
(23 Jan 2004 22:38 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
tb@xxxxxx
(24 Jan 2004 06:48 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft] Tom Lord (24 Jan 2004 18:41 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
tb@xxxxxx
(24 Jan 2004 19:34 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Tom Lord
(24 Jan 2004 21:48 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Ken Dickey
(23 Jan 2004 21:47 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Tom Lord
(23 Jan 2004 23:22 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Ken Dickey
(25 Jan 2004 01:03 UTC)
|
|
Re: Parsing Scheme [was Re: strings draft]
Tom Lord
(25 Jan 2004 03:01 UTC)
|
|
Re: strings draft
Matthew Dempsky
(25 Jan 2004 06:59 UTC)
|
|
Re: strings draft
Tom Lord
(25 Jan 2004 07:16 UTC)
|
|
Re: strings draft
Matthew Dempsky
(26 Jan 2004 23:52 UTC)
|
|
Re: strings draft
Tom Lord
(27 Jan 2004 00:30 UTC)
|
> From: xxxxxx@becket.net (Thomas Bushnell, BSG)
> There should be string-id=? (or some other name) which implements the
> Scheme identifier matching rules, which should be specified for the
> required character set, and left unspecified for all other
> characters.
> None of this requires or even implicitly uses a case mapping function.
>> The standard would still need to specify CHAR-DOWNCASE.
> Why? Is there some government bureau that will shut us down if the
> next RnRS eleminates it?
> I don't mind STRING-DOWNCASE, of course, which should have a locale
> argument and be specified to permit the Correct Unicode Thing.
Ok -- I think we can agree on some things. You're roughly right, I
think.
We should also point readers in general to:
http://www.unicode.org/reports/tr15/#Programming_Language_Identifiers
which is Annex 7 ("Programming Language Identifiers") of Unicode
Technical Report 15 ("Unicode Normalization Forms").
Enclosed is a more fleshed-out and improved description of the
approach you're advocating, plus its reconciliation with my
suggestions for R6RS (which, frankly, don't need to change very much
-- mostly this just involves adding new material).
For SRFI-50 list relevence: let me point out that this doesn't change
the proposed char/string FFI at all. On the other hand, the fact the
recommendations for R6RS continue to work out nicely is confirmation
that the analysis that leads to those FFI recommendations is sound.
So far we've more or less made peace with R5RS, my recommendations for
R6RS, Thomas Bushnell's thoughts on supporting linguistically sane
Scheme identifiers, Shiro's concerns about implementations using
character sets other than Unicode and its subsets/extensions, Bear's
work on infinite character sets, and the emerging design of Pika.
I think what Thomas B. is suggesting is better provided by this:
* (identifier? s) => <bool>
Return #f unless `s' is a legal identifier name.
It is required that:
(identifier? (symbol->string s)) => #t
for all symbols s.
* (fold-identifier name) => folded
Where NAME is a string containing an identifier
name and FOLDED is a string containing an equivalent
identifier name.
Two identifiers are equivalent if and only if:
(string=? (fold-identifier a)
(fold-identifier b))
FOLD-IDENTIFIER is required to be idempotent:
(string=? (fold-identifier a)
(fold-identifier (fold-identifier a)))
=> #t ; for all identifiers a
and, of course, IDENTIFIER? is closed under FOLD-IDENTIFIER:
(or (not (identifier? s))
(identifier? (fold-identifier s)))
=> #t ; for all strings s
The definition of FOLD-IDENTIFIER must be consistent with the
recommendations of Annex 7 ("Programming Language Identifiers" of
Unicode Technical Report 15 for identifier names comprised
entirely of Unicode characters. For this purpose, the characters
of the portable Scheme character set are considered to be Unicode
characters. (A short summary of the implications of this
requirement for portable identifiers is that given a portable
identifier, FOLD-IDENTIFIER must map #\A..#\Z to #\a..#\z.)
(FOLD-IDENTIFIER is preferable to STRING-ID=? because it
produces a canonical form of each identifier explicitly
rather than implicitly. The canonical form is useful because
it can be hashed, stored in a trie, etc. It would be
impractical to implement, for example, a symbol table in a
compiler given only STRING-ID=?.)
* (concatenate-identifiers s0 s1 ...) => id
Return a string ID, containing an identifier name which
is the concatenation of the arguments which must themselves
be identifier names.
If all of the arguments are portable Scheme identifiers, then
this function must behave identically to STRING-APPEND
(As nearly as I can tell, CONCATENATE-IDENTIFIERS is needed
because IDENTIFIER? won't be closed under STRING-APPEND -- but
I could be mistake about that. More research is needed.)
Now, what becomes of the character class procedures such as
CHAR-NUMERIC? I think that these should be retained and corrected so
that one can write a portable Scheme lexical analyzer which can accept
as input programs using the character set extensions of its host
implementation. From what I can tell, that would require the new
procedures:
* (char-id-start? c) => <bool>
Return #t if C is a valid first character in an identifier.
* (char-id-extend? c) => <bool>
Return #t if C is a valid non-first character in an identifier.
* (canonicalize-identifier s) => ID | #f
Given a string S comprised of at least one CHAR-ID-START? character
followed by any number of CHAR-ID-EXTEND? characters, return a
valid identifier name (in the sense of IDENTIFIER?) corresponding
to S or #f if no such identifier name can be constructed.
If S consists only of portable Scheme characters, the result must
be STRING=? to S and not EQ? to S.
* (string->parsed-symbol s)
S must be an IDENTIFIER? string. Return the symbol denoted by that
identifier if it were used in a quoted context in a Scheme expression.
(Note how this differs from STRING->SYMBOL.)
* (string->parsed-character s) => <char> | #f
Given a string S whose contents are syntactically a character
constant, return the character that constant denotes or #f if
there is no such character.
If we want to permit extended string syntaxes, at least this is
needed:
* (string->parsed-string s) => <string> | #f
S must be a string whose contents are syntactically a string
constant, return a string that constant denotes or #f if there
is no such string.
Perhaps we'd also want similar procedures for other areas of syntactic
extensibility.
Now, what about the character ordering procedures (e.g. CHAR<?,
STRING<? etc.)? I think these should remain unchanged -- they should
relate to the integer mappings of characters. (Implementations or
future standards are free to add locale parameters or introduce
alternative procedures which are linguistically sensative.)
What about case independent character ordering (e.g., CHAR-CI<? and
STRING-CI<?)? I see no compelling reason to eliminate them at this
stage -- they're still useful. I think they should be specified to be
consistent with the single-character default case foldings of Unicode,
where the portable character set is considered to consist of Unicode
characters. This will allow portable Scheme programs to use these
procedures to write programs which accurately manipulate Scheme
programs that use nothing but the portable character set. It would,
for example, allow a portable-character-set implementation of
FOLD-IDENTIFIER. It also reifies into Scheme a sanctioned (even if
non-preferred) sense of Unicode character case -- while Scheme should
_also_ evolve facilities for linguistically preferrable case handling,
these facilities will be useful for Scheme programs communicating with
other systems that use only the single-character case mappings.
(Again, implementations and future standards are not precluded from
adding additional parameters or new procedures for default or
locale-specific case handling).
What about case mappings (CHAR-UPCASE and CHAR-DOWNCASE). Again:
retain them; specify them as using the Unicode single character
mappings; permit implementations to add parameters are new procedures
-- the result allows portable Scheme programs to handle portable
Scheme program texts and captures a useful Unicode text process.
In terms of my "strings draft" -- there is one R6RS recommendation
that should change more substantially than the tweaks suggested above.
I wanted to modify 6.3.4 to say:
These procedures [the character classes] return #t if their
arguments are alphabetic, numeric, whitespace, upper case, or
lower case characters, respectively, otherwise they return #f.
These procedures _must_ be consistent with the procedure READ
provided by the implementation. For example, if a character is
CHAR-ALPHABETIC?, then it must also be suitable for use as the
first character of an identifier.
`a..z' and `A..Z' _must_ be alphabetic and _must_ be respectively
lower and upper case.
#\space, #\tab, and #\formfeed _must_ be CHAR-WHITESPACE?.
`0..9' _must_ be CHAR-NUMERIC?.
No character may cause more than one the procedures
CHAR-ALPHABETIC?, CHAR-NUMERIC? and CHAR-WHITESPACE? to return
#t.
No character may cause more than one of the procedures
CHAR-UPPER-CASE? and CHAR-LOWER-CASE? to return #t.
Programmer's are advised that these procedures are unlikely to be
suitable for linguistic programming in portable code while
implementors are strongly encouraged to define them in ways that
make them a reasonable approximation of their linguistic
counterparts.
It should say:
These procedures [the character classes] return #t if their
arguments are valid identifier start characters, valid identifier
extension characters, alphabetic, numeric, whitespace, upper
case, or lower case characters, respectively, otherwise they
return #f. These procedures _must_ be consistent with the
procedure READ provided by the implementation. For example, if a
character is CHAR-ID-START?, then it must also be suitable for
use as the first character of an identifier.
`a..z' and `A..Z' _must_ be id-start and id-extend characters and
_must_ be respectively lower and upper case.
`a..z' and `A..Z' _must_ be alphabetic. If the argument to
CHAR-ALPHABETIC? is a Unicode character, the it must return #t
if and only-if the character is in one of the Unicode general
categories
Lu Ll Lt Lm Lo Nl
#\space, #\tab, and #\formfeed _must_ be CHAR-WHITESPACE?.
`0..9' _must_ be CHAR-NUMERIC?.
No character may cause more than one the procedures
CHAR-ID-START?, CHAR-NUMERIC? and CHAR-WHITESPACE? to return
#t.
No character may cause more than one of the procedures
CHAR-UPPER-CASE? and CHAR-LOWER-CASE? to return #t.
Programmer's are advised that these procedures are unlikely to be
suitable for linguistic programming in portable code while
implementors are strongly encouraged to define them in ways that
make them a reasonable approximation of their linguistic
counterparts.
A final note: the desirability of the -CI, -UPCASE, and -DOWNCASE
procedures hinges on the assumption that the portable Scheme character
set is a proper subset of Unicode. One can imagine a Scheme standard
that insisted on Unicode, and that requires a much larger set of valid
identifier characters. Though abstractly attractive, such
requirements would preclude tiny implementations of Scheme. Having a
small and simply structured portable character set, and then adding on
to that a level of _optional_ conformance for all of Unicode, is a far
more practical idea.
-t