> As long as the text of the SRFIs doesn't change, the APIs won't > change, so we can make the changes without bothering the authors. This is the biggest relief to us :) If we are permitted to change markup at will, it's just a matter of hashing out the technical approach (and any approach will be an improvement). > I'm okay if what we do breaks some authors' tools, but I'd like us > to try to minimize that. What if we had a script that uses one of those popular tagsoup libraries (lenient HTML -> strict HTML) and then does post-processing to generate something close enough to the standardized (X)HTML structure we want? If the script worked well enough, that could be low-friction enough for SRFI authors. > I've settled on https://github.com/ericchiang/pup and exported the > whole thing as JSON and moved on from there... > I just checked, and the excellent tool Pandoc does indeed support > conversion from HTML to XHTML, so that process could be automated. Always nice to have more tools at our disposal! > Given the current state of available Scheme developers, I doubt more > than a few people would actually "review" the output of such tools... > Most will just "use" the output as is. > > On the other side, the actual text of the HTML is reviewed at least by > the author and the SRFI index editor. I agree that it would definitely be better to have visible markup reviewed by the SRFI author in the final format, but in the meantime for old SRFIs, the choice is between an index incompletely reviewed by third parties and no index :) > I still think we can use HTML parsers instead of requiring strict > XHTML parsing, though. Like Ciprian, I would be in favor of XHTML eventually because it's easier to parse (XML libraries are absolutely everywhere and are more reliable than tagsoup). We may be able to use a tagsoup library to automatically convert HTML to XHTML. > I am all in favor of having more volunteers! Unless the conversion > is automated, I would prefer not to do it all myself. I can empathize with this. I was thinking along the lines of there being on average one new SRFI per month to format, which ought to take 30-60 minutes if the source HTML is reasonable. And if no-one can make it on a particular month, there would be no time pressure since that SRFI can simply be formatted later. In my mind, the big issue re: volunteers would be the huge backlog of existing SRFIs. But again there's no time pressure. If it takes 3 years to convert the whole backlog the result will still be useful. Perhaps it would be best to first settle on the right format for new SRFIs. Then once we convert the old ones, we'd stand a chance to get the conversion right once and not have to go back and do a second round of fixups to them. > We would have to agree on the format in which the index was stored, > e.g. Ciprian's suggested format, so that more than one tool could > make use of that data. I would prefer to have the SRFI source documents in a standard format, and then have an API server built upon that foundation. The API could serve, index and query them in a number of different ways, whatever people need. The API server would be open source so hardcode users can also use it locally as a library. But I would strongly favor a "single point of truth" in the documents themselves instead of a separately maintained index (indexes auto-generated from the source documents are fine, in any format). This based on observing a kind of Murphy's Law that "Any information that can go out of sync, will go out of sync" :) > In my view, if you just want to "index" the existing body of SRFI > documents, then adding HTML classes to all the existing documents is > more work than taking one of the already "crawled" lists (by you, by > me, or by others), and just double-checking them to make sure they > are "complete" and "correct". > > Afterwards, starting from this index one could write a small program > that goes back to the SRFI, and based on some simple heuristics > manages to back-reference the place where these definitions are > "defined". Not sure about this. I tried marking up a couple SRFIs with classes and it was quite swift and easy work once I got the hang of it. Once it becomes second nature, double-checking will be almost the same act as just writing the extra tags/classes. (When I was writing the tags, I continually ran a script to regenerate the index, so I was looking at the source and the index side by side.) Particularly if the results go right into the source documents, so one doesn't have to worry about which version of which index one is using (just auto-generate index from source). This may just be some kind of bias in my personality and experiences but I strongly feel that things are simpler if there is one officially blessed place where we have the ground truth. I always get confused managing different versions of things in different places. I can't make an objective case for why I prefer this, apart from having many experiences where it has proven simpler. > Namely when they are "ready-for-being-published", we just export > them into HTML and annotate them ourselves. > > And this is also why the XHTML approach won't be to cumbersome. We > can just "translate" even the HTML ones in XHTML making sure they > are conformant. > My suggestion for XHTML is purely pragmatical: XHTML is XML, then > one can just use any XML library to parse the document. > > Now I know that it "seems" that there are many HTML parsers out > there, unfortunately this is not true... There are a few, at least > for the most popular programming languages, however they are > "bloated" and full of issues... I fully agree with these points. In particular, it'd be very nice if the final format doesn't require a tagsoup parser library, but can use a strict XML parser instead. IMHO it's fine if we use tagsoup to do the initial conversion from the author's HTML to the final format, but once an SRFI is in the final format, it would be nice to let tool writers use simple and reliable parsers. > Thus I think that there are two different problems: > * "new" SRFI documents that have to be "structured", "annotated" and "indexed"; > * "old" SRFI documents that can be just "back-referenced" starting > from an existing "index"; > * (and a third problem) "transforming" the old SRFI documents so that > they are in-line with the newly proposed format; This is a good delineation :) > However thinking from a practical point-of-view, especially given > how "rudimentary" some XML/HTML parsing libraries are for various > languages, I think it's easier to search for just `^(.*[ > ]+)?def-proc([ ]+.*)$` than trying to imagine in how many ways one > can combine the classes `def` and `proc`. > > A good exercise for this "annotation" experiment I think would be: > * take the whole HTML file, and replace newlines with spaces; `tr '\n' ' '`; > * write a `grep`-based line to extract, say all procedure names; I would think it best not to worry about grep and regexps. In principle it's nice to be able to grep anything, but those line-oriented tools are so hopelessly ill-suited for dealing with XML's nesting, quoting, escaping and variations in whitespace that it's almost impossible to write anything reliable using them. I've done that every once in a while for over a decade out of laziness, and they always fail even with a simple corpus of text. XML/HTML tree walking libraries are so widespread and easy to use now that I would just start from the assumption that tools can access them. There are also some tools that convert XML to a line-oriented syntax for grepping, e.g. http://xmlstar.sourceforge.net/doc/UG/ch04s07.html One point in favor of using single classes as you suggest, is that XPath expressions don't support multiple classes easily. I don't know how big a deal that is but it's good to keep in mind and explore. In general, I would concede that XHTML documents support multiple classes per element, and people commonly do that, so if we don't permit these XHTML conventions like this it is likely to confuse people. Then again, if we have a strict lint tool to warn people about it, problems will be easier to avoid. Let's experiment until we find something palatable :p > in my view the editor takes the text and more "actively" formats it, > just like the editor of a book or magazine would. Arthur would be in the best position to estimate whether the workload is reasonable. Having a very involved SRFI editor would obviously produce the best documents, but volunteer work can also be pretty draining and thankless at times, and most people need breaks and lighter periods now and then to keep motivated. > > Would you and Lassi be willing to prepare a proposal together and > > submit it to <srfi-discuss> for public discussion? > > At the moment I think some more "brainstorming" would be better. I agree with Ciprian -- it would probably be best to brainstorm and iterate on some prototype scripts and (X)HTML structures until we find something that works fairly well. There are too many unknowns to plan ahead in detail. But once we have something functional then a formal proposal would definitely be in order :) Should we have more beta testers though? At least Per Bothner expressed some measure of interest in one of last year's threads. > BTW, the IETF has just gone through a similar approach by moving > RFC's from plain text to an XML-based format: > https://xml2rfc.tools.ietf.org/ > > Perhaps we could just reuse their tools? (They are / will be better > maintained than anything we can come up with.) That's great news :) Here's one example of their XML format: <https://tools.ietf.org/tools/templates/davies-template-bare-06.xml> Personally, I'd rather start with XHTML and add/change only what we need. XHTML is familiar and has short tag names for everything. These custom XML documentation formats tend to be quite verbose and tailor-made for a specific type of document (I've long avoided DocBook for the same reason -- just seems over-engineered to my taste). In my experience, quality of tools is strongly correlated with the simplicity of the format. People love to write lots of tools for simple formats because it's immediately rewarding. On the other hand, complex things tend to have poor tooling even with lots of industry backing. So IMHO the priority would be to make the format simple and familiar (hence similar to HTML). Not saying the IETF RFC XML is too complex but it seems quite verbose and divergent from HTML. I would suggest roughly the following design approach: * Start with some version of XHTML * Use only the XHTML tags we need in a rigid structure * Add class attributes to signify everything else * Specify a standard CSS stylesheet using those tags and classes Lassi