string-* and char-set:*
Duy Nguyen
(19 Sep 2019 10:58 UTC)
|
Re: string-* and char-set:* Lassi Kortela (19 Sep 2019 13:16 UTC)
|
Re: string-* and char-set:*
Arthur A. Gleckler
(19 Sep 2019 21:55 UTC)
|
Re: string-* and char-set:* Lassi Kortela 19 Sep 2019 13:16 UTC
Thank you for reading and commenting! > The "Character class constants" probably should be built on top of > srfi-14. Which could be just very simple wrapper to provide > standardized names for predefined ascii charsets, all these charsets > are just > > (char-set-intersection char-set:ascii <some other charset>) Damn, I completely missed the existence SRFI 14 and Arthur didn't tip me off to it either :) Sorry about that! Indeed, the ASCII sets can be derived exactly as you write. According to the Gauche manual, SRFI 14 is being incorporated into R7RS-large as well. So it'd be doubly good to interoperate with it. There are two main reasons to deal with ASCII only: 1) Behavior is easier to understand than full Unicode. This helps ensure correct implementations of file formats and the like. 2) ASCII operations are faster when implemented at the same level as Unicode (i.e. both in Scheme or both in C/native code). For point 1, it doesn't make much difference whether we go with strings or char-set objects. For point 2, the character class constants in SRFI 175 are meant for convenience rather than efficiency so it doesn't make a difference either. What's important is for the procedures (predicates and letter/number converters) to be efficient for use in low-level parsers and such. The predicates could also mostly be implemented in terms of SRFI 14 char-sets, but it's not as obvious how to make fast implementations of general char-sets. By contrast, fast ASCII primitives are just fixnum arithmetic and some if's here and there. (If the current primitives are too slow for some real-world task, I'd love to hear about that.) SRFI 14 is one of the most popular SRFIs (17 implementations!) but not as ubiquitous as strings. From a compatibility standpoint, all Schemes have strings, so having the character classes as strings is the simplest and most compatible option. I don't know. I'm open to both approaches and would like to hear opinions from more people. >> Should we have the ascii equivalent of many string-* procedures (e.g. >> string-upcase) That may be a good idea. I left them out since they can be easily made with string-map, available in R7RS or SRFI 13: > (import (srfi 175)) > (string-map ascii-upcase "hello world") "HELLO WORLD" Unlike Unicode characters, ASCII characters are never combined with neighboring characters, so string-map is always correct for jobs like this. It may also be fast enough for all practical jobs. If you have some real job (protocol parsing etc.) where you have to uppercase or lowercase some text and string-map is not fast enough, I'd love to hear about that. The rest of the SRFI 13 (String Libraries) procedures (listed at https://srfi.schemers.org/srfi-13/srfi-13.html#ProcedureIndex) work out of the box for ASCII strings as long as the implementation's native string encoding is an ASCII superset such as Unicode. So if I've understood correctly, separate procedures are not needed. Apart from the case conversion ones they just create, compare, slice, join and reverse strings. Many SRFI 13 procedures also accept char-set objects, so maybe having ASCII char-sets would be useful from that standpoint too. Then again, they also accept predicate procedures, which we have. In general, if some real job isn't fast enough by combining the current SRFI 175 stuff with SRFI 13, please don't hesitate to say. But if things are fast enough as it is, I'd like to stick with a somewhat minimal set. Scheme has higher-order functions so most jobs involvind predicates is delightfully simple provided that speed is not an issue.