Re: string-* and char-set:*

Show/hide message thread
string-* and char-set:* Duy Nguyen (19 Sep 2019 10:58 UTC)
Re: string-* and char-set:* Lassi Kortela (19 Sep 2019 13:16 UTC)
Re: string-* and char-set:* Arthur A. Gleckler (19 Sep 2019 21:55 UTC)
Re: string-* and char-set:* Lassi Kortela 19 Sep 2019 13:16 UTC
Thank you for reading and commenting!

> The "Character class constants" probably should be built on top of
> srfi-14. Which could be just very simple wrapper to provide
> standardized names for predefined ascii charsets, all these charsets
> are just
>
> (char-set-intersection char-set:ascii <some other charset>)

Damn, I completely missed the existence SRFI 14 and Arthur didn't tip me
off to it either :) Sorry about that! Indeed, the ASCII sets can be
derived exactly as you write. According to the Gauche manual, SRFI 14 is
being incorporated into R7RS-large as well. So it'd be doubly good to
interoperate with it.

There are two main reasons to deal with ASCII only:

1) Behavior is easier to understand than full Unicode. This helps ensure
correct implementations of file formats and the like.

2) ASCII operations are faster when implemented at the same level as
Unicode (i.e. both in Scheme or both in C/native code).

For point 1, it doesn't make much difference whether we go with strings
or char-set objects.

For point 2, the character class constants in SRFI 175 are meant for
convenience rather than efficiency so it doesn't make a difference either.

What's important is for the procedures (predicates and letter/number
converters) to be efficient for use in low-level parsers and such. The
predicates could also mostly be implemented in terms of SRFI 14
char-sets, but it's not as obvious how to make fast implementations of
general char-sets. By contrast, fast ASCII primitives are just fixnum
arithmetic and some if's here and there. (If the current primitives are
too slow for some real-world task, I'd love to hear about that.)

SRFI 14 is one of the most popular SRFIs (17 implementations!) but not
as ubiquitous as strings. From a compatibility standpoint, all Schemes
have strings, so having the character classes as strings is the simplest
and most compatible option.

I don't know. I'm open to both approaches and would like to hear
opinions from more people.

>> Should we have the ascii equivalent of many string-* procedures (e.g.
>> string-upcase)

That may be a good idea. I left them out since they can be easily made
with string-map, available in R7RS or SRFI 13:

 > (import (srfi 175))
 > (string-map ascii-upcase "hello world")
"HELLO WORLD"

Unlike Unicode characters, ASCII characters are never combined with
neighboring characters, so string-map is always correct for jobs like
this. It may also be fast enough for all practical jobs. If you have
some real job (protocol parsing etc.) where you have to uppercase or
lowercase some text and string-map is not fast enough, I'd love to hear
about that.

The rest of the SRFI 13 (String Libraries) procedures (listed at
https://srfi.schemers.org/srfi-13/srfi-13.html#ProcedureIndex) work out
of the box for ASCII strings as long as the implementation's native
string encoding is an ASCII superset such as Unicode. So if I've
understood correctly, separate procedures are not needed. Apart from the
case conversion ones they just create, compare, slice, join and reverse
strings.

Many SRFI 13 procedures also accept char-set objects, so maybe having
ASCII char-sets would be useful from that standpoint too. Then again,
they also accept predicate procedures, which we have.

In general, if some real job isn't fast enough by combining the current
SRFI 175 stuff with SRFI 13, please don't hesitate to say. But if things
are fast enough as it is, I'd like to stick with a somewhat minimal set.
Scheme has higher-order functions so most jobs involvind predicates is
delightfully simple provided that speed is not an issue.