I have been reviewing the character-set SRFI in light of my recent study
of Unicode and internationalisation. The main difference is that I have
killed ASCII-RANGE->CHAR-SET and replaced it with UNICODE-RANGE->CHAR-SET.
The general design principles are:
- I don't want to *require* conformant Schemes to use Unicode.
I specifically want to allow "small character" implementations
such as ASCII or Latin-1.
- However, I do want code to be portable across conformant implementations.
So, the one routine in SRFI-14 that exposes encodings commits to a
Unicode interface as the uber-spec for character encodings. This is
*independent* of how chars are stored/represented "under the hood,"
and the API allows the user to request different behaviours if a program
requests a character via Unicode that is not provided by the
implementation.
More elaborate hackery would need a *character* SRFI, with routines for
encoding and decoding characters; that's beyond the scope of SRFI-14.
I append the spec for UNICODE-RANGE->CHAR-SET below. Comments?
-Olin
unicode-range->char-set lower upper [error? base-cs] -> char-set
unicode-range->char-set! lower upper error? base-cs -> char-set
Returns a character set containing every character whose Unicode
code lies in the half-open range [LOWER,UPPER).
The [LOWER,UPPER) range must lay completely within the general Unicode
space: 0 <= LOWER <= UPPER <= 2^32 - 1. If the requested range includes
unassigned Unicode values, these are silently ignored (the current Unicode
specification has "holes" in the space of assigned codes). If the
requested range includes "private" or "user space" codes, these are
handled in an implementation-specific manner; however, a Unicode-based
Scheme implementation should pass them through transparently.
If any code from the requested range specifies a valid, assigned Unicode
character but has no corresponding representative in the implementation's
character type, then (1) an error is raised if ERROR? is true, and (2) the
code is ignored if ERROR? is false (the default). This might happen, for
example, if the implementation uses ASCII characters, and the requested
range includes non-ASCII characters.
If character set BASE-CS is provided, the characters specified by the
range are added to it. UNICODE-RANGE->CHAR-SET! is allowed, but not
required, to side-effect and reuse the storage in BASE-CS;
UNICODE-RANGE->CHAR-SET produces a fresh character set.
Note that ASCII codes are a subset of the Latin-1 codes, which are in turn
a subset of the 16-bit Unicode codes, which are themselves a subset of the
32-bit Unicode codes. We commit to a specific encoding in this routine,
regardless of the underlying representation of characters, so that client
code using this library will be portable. I.e., a conformant Scheme
implementation may use EBCDIC or SHIFT-JIS or even 6BIT to encode
characters; it must simply map the Unicode characters from the given range
into the native representation (when possible).