SRFI-115 issues John Cowan (20 Oct 2013 21:37 UTC)
|
Re: SRFI-115 issues
Alex Shinn
(22 Oct 2013 01:15 UTC)
|
Re: SRFI-115 issues
John Cowan
(22 Oct 2013 13:40 UTC)
|
Re: SRFI-115 issues
Alex Shinn
(11 Nov 2013 01:21 UTC)
|
Re: SRFI-115 issues
John Cowan
(11 Nov 2013 05:02 UTC)
|
Re: SRFI-115 issues
Alex Shinn
(11 Nov 2013 06:52 UTC)
|
Alex Shinn scripsit: > How to integrate with the PCRE regular expression library? The > intention is to make this the primitive notation, and for POSIX > require a separate wrapper such as (pcre->sre <str>). Alternately > we could allow both in the same API, as in IrRegex, though this > introduces an ambiguity. Finally, we could make this entirely separate > from the PCRE API. I think this is the best way: separate it from string-based REs, with conversions to and from handled by some other library. > From SCSH's SREs I've left out the dsm notation which doesn't seem > as though it need be exposed to the user, the posix-string notation > because it's better accomplished with pcre->sre, and uncase whose > exact semantics and motivation I never quite understood. I also left > out the blank character class since it's a GNU extension without an > accepted Unicode definition. +1 on all points. > | and & are allowed, but the former must be escaped, which looks > fairly ugly. For aesthetics they can also be written or and and, > respectively. Stick with just `and` and `or`, I think. > I've kept most IrRegex extensions, but made many of the non-POSIX > ones optional, designated by the regexp-extended feature, and backref > specifically gets its own feature regexp-backrefs. This troubles me. It leaves things too much up to the implementation, and not enough flexibility for the user. These extensions work only if you have a backtracking NFA, which is inherently less efficient. In order to provide both efficiency and power, the implementation would have to provide both an NFA (to be used in the general case) and a DFA or Thompson-NFA (to be used if the extensions are not needed). This is what Perl does, but Perl is a rag-bag by nature. I'd say: leave these things out of the main library, but add another library that provides them but using the same API. This way, the user can load (srfi 115) or (srfi 115 extended) and get the most suitable engine. Of course, they can be the same engine if the implementer doesn't care that much about speed. If the user needs to load both, using the R6RS/R7RS prefix feature makes both APIs available. > I left out the common utility patterns integer, domain, url, etc., > which can easily enough be included in libraries and unquoted into > SREs. I think the large language should have these, either in this SRFI or in another SRFI, but in any case in (srfi 115 patterns). > The => shorthand for named matches used by IrRegex would perhaps have > better been named <-, the more common choice to represent binding in > parsers, leaving => open for the send-to-procedure idiom used in cond. No opinion on this, except that if a change is to be made, this is the time to make it. > The API uses string indices for start, end and match positions, which > is slow for a UTF8 implementation. That is the Right Thing. > Many Unicode properties as well as Unicode script names that are > available in PCRE are not provided as char-sets here. I'm working on a Unicode properties API. > SREs with embedded SRFI 14 char-sets can't be written and read back > in portably. R7RS WG2 is considering external syntax representations, > and may include them for SRFI 14 char-sets as well, making this a > non-issue. Not quite a non-issue, because if we have those things it will probably be as macros, not as lexical representations. So they will need to be unquoted and won't work in data files. > On the other hand SREs with embedded compiled regexps, as allowed in > SCSH, are not supported, largely to preserve writeability. Instead you > should embedded other SREs. +1 > regexp->sre is frequently requested in IrRegex. It is useful and the > only argument against it is that it would require more memory for > compiled regexps (linearly more for most implementations), but I'll > wait to see if it's requested in the discussion. I think it's an important thing to have. The wording should allow for either caching within the regexp object or decompilation, and should warn that caching may produce a space leak. -- John Cowan http://ccil.org/~cowan xxxxxx@ccil.org Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, The day and hour soon are coming / When all the IT folks say "Gosh!" It isn't from a clever lawsuit / That Windowsland will finally fall, But thousands writing open source code / Like mice who nibble through a wall. --The Linux-nationale by Greg Baker