On 7/16/05, Matthew Flatt <xxxxxx@cs.utah.edu> wrote: > > I am personally convinced (by this discussion and by past experience) > that `string-ci=?' as defined in the SRFI is not what you really want > under most circumstances. But it's often a good approximation. I think > that Scheme needs at least an operation like `string-ci=?' for portable > programs, something like it will exist in most implementations, it's > simple to implement, and it's consistent with the rest of the proposal > ---- so it still seems right to me to put it in the SRFI, despite its > many flaws. This seems to me a very disturbing precedent, if R6RS will deliberately specify "the wrong thing" just because it cannot figure out "the right thing." In a case like this, I think it would be better not to specify anything. So far, the following conflicts exist between R5RS (ASCII) and Unicode: 1) Case operations. In ASCII, CHAR-*CASE and STRING-*CASE can be defined in terms of one another, and either can be used to define CHAR-CI=? or STRING-CI=?, or these can be provided separately. In ASCII, at the very least STRING-CI=? is required for basic parsing of text formats (though in reality it is possible that such parsing will be done directly on octet-arrays rather than native strings). CHAR-CI=? and CHAR-*CASE can be used to improve efficiency in low-level libraries such as parsers and regex matchers. In Unicode, the character-level operations have no meaning. It is in fact far, far worse to provide character-level operations that "do as much as they can" given the Unicode data for single characters, because then you encourage people to use this feature, but by definition any program that performs case operations at the character level is broken and can never be fixed or extended. Also, STRING-CI=? cannot be defined in terms of STRING-*CASE, and both are locale-specific procedures. There are two ways to reconcile these differences. One is to unify the procedures, the other is to create separate procedures. Unified should be something like: CHAR-*CASE, CHAR-CI=? - as in R5RS - folds ASCII *only* (please don't enourage bad code) STRING-*CASE, STRING-CI=? - takes optional locale argument (which may be ignored) - guaranteed to work at least on ASCII characters as in R5RS - optionally Unicode aware The alternative would be to provide the above procedures as ASCII-centric only, and provide new procedures for linguistic case operations, such as TEXT-*CASE str [locale] TEXT-FOLD-CASE str [locale] ; faster for multiple comparisons TEXT-CI=? str [locale] 2) Collation. This is insanely complicated. Apart from being highly locale dependent, there are an infinite number of variations one may want, such as ignoring leading "The" in English, sorting sequences of digits as numbers, and so on. In Japanese collation is an AI-complete problem, and attempts to do it properly could require options as to what dictionary to use, among others. To unify these one would expect something like: STRING<? a b [keyword-arguments ...] or as a separate procedures TEXT<? a b [keyword-arguments ...] and/or something like STRING-COLLATE list-of-strings [keyword-arguments ...] Here it is very important to be able to sort lexically on code-points for use in efficient and portable search tree and database algorithms. Therefore, if STRING<? is overloaded, one should expect a keyword argument such as (STRING<? a b 'lexical: #t) to force this standardized sorting. 3) Normalization. Implementations should be allowed to internally keep all strings normalized, and/or perform normalized comparisons for the likes of STRING=?. Unlike subjective collation, one should generally want to treat strings as identical to their normalized forms, however in implementing certain low-level libraries (such as encoding converters) one may want explicit control of code-point equivalence. Furthermore one may want library procedures to convert between different normalized forms, however if the implementation always maintains a specific internal normalization, conversions to other normalizations by definition cannot return native strings. There is a lot remaining to discuss and experiment in this area, but for the time being separating STRING=? from TEXT=? may be a consideration. 4) Definition of a character. Rather than pure code-point iterators, one may wish to define operations on sequences of code-points representing higher-level linguistic components (e.g. join accent marks) or even glyphs (returning ligatures, or consonant plus vowels for scripts like Thai). Fortunately these can be defined in terms of Unicode code-points, so they can be relegated to future libraries. It should be kept in mind that we will likely always have Schemes implementing the full range of Unicode support, from nothing to complete (and then some), just as we have variation in the range of number tower support. However, unlike in the number tower, the ASCII vs. Unicode distinctions are semantically different, and even in an implementation supporting all of Unicode, you may wish to use ASCII-specific procedures which ignore or possibly even throw an error in the presence of non-ASCII data. Moreover, there is a very significant overhead in the Unicode versions, both in speed and in memory usage. Because using an ASCII-level operation makes clear what you are doing (non-linguistic parsing), and because it can be significantly more efficient, I'm leaning towards providing separate procedures. -- Alex