(Hopefully) final changes to SRFI-14 (character sets)
shivers@xxxxxx 30 Apr 2000 00:45 UTC
As I prepare to conclude work on the SRFI-13 string library, I have reworked
the SRFI-14 character-set spec, principally to get it synced up with the
Unicode world. Mike S will presumably have the new draft available at
http://srfi.schemers.org/srfi-14/srfi-14.txt
(It is also available at
ftp://ftp.ai.mit.edu/people/shivers/srfi/14/srfi-14.txt)
A summary of the changes appears below. I have no further changes I wish
to make to this library. If review does not reveal any problems, we can
put this to bed.
-Olin
-------------------------------------------------------------------------------
- Added a function for hashing character sets.
- Uniformly extended the char-set constructor procedures to take an optional
BASE-CS argument; in this case, the procedure adds the requested characters
to the characters already in BASE-CS. This allows convenient incremental
construction of heterogeneous character sets, e.g.
(predicate->char-set vowel?
(list->char-set '(#\+ #\-)
(string->char-set "13579")))
or, more efficiently
(predicate->char-set! vowel?
(list->char-set! '(#\+ #\-)
(string->char-set "13579")))
- I removed the seventeen predicates
char-lower-case? char-upper-case? char-title-case?
char-letter? char-digit? char-letter+digit?
char-graphic? char-printing? char-whitespace?
char-iso-control? char-punctuation? char-symbol?
char-hex-digit? char-blank? char-ascii?
char-empty? char-full?
They belong in a *character* library, not a char-set library.
- I have made pervasive changes to the SRFI to bring it into alignment with
Unicode concepts:
- Changed the name ASCII-RANGE->CHAR-SET to the more modern
UCS-RANGE->CHAR-SET, and provided a full specification in terms
of UCS/Unicode.
- Changed "alphabetic" and "numeric" to Unicode terms "letter" and "digit."
- Split "symbols" out from "punctuation" characters, in conformance with
Unicode.
- Renamed CHAR-SET:CONTROL to CHAR-SET:ISO-CONTROL, to make clear that
weirdo Unicode control codes are excluded. (This is in alignment with
Java.)
- Added CHAR-SET:TITLECASE to accompany CHAR-SET:LOWERCASE &
CHAR-SET:UPPERCASE.
- Specified what the standard character sets are in Unicode, Latin-1
and ASCII implementations. These definitions are almost completely
compatible with Java's. (The only real incompatibility is the definition
of whitespace.) The ASCII/Latin-1/Unicode specs are compatible, so
that code written using these sets has a good chance of being portable
across implementations with different underlying character representations.
Being compatible with Java is occasionally challenging, as the Java
definitions are not internally consistent. There is discussion of the
specifics where relevant.