Email list hosting service & mailing list manager

Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Lassi Kortela (05 Mar 2019 22:14 UTC)
Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Marc Nieper-Wißkirchen (06 Mar 2019 10:12 UTC)

Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Lassi Kortela 05 Mar 2019 22:14 UTC

 > As long as the text of the SRFIs doesn't change, the APIs won't
 > change, so we can make the changes without bothering the authors.

This is the biggest relief to us :) If we are permitted to change
markup at will, it's just a matter of hashing out the technical
approach (and any approach will be an improvement).

 > I'm okay if what we do breaks some authors' tools, but I'd like us
 > to try to minimize that.

What if we had a script that uses one of those popular tagsoup
libraries (lenient HTML -> strict HTML) and then does post-processing
to generate something close enough to the standardized (X)HTML
structure we want? If the script worked well enough, that could be
low-friction enough for SRFI authors.

 > I've settled on and exported the
 > whole thing as JSON and moved on from there...

 > I just checked, and the excellent tool Pandoc does indeed support
 > conversion from HTML to XHTML, so that process could be automated.

Always nice to have more tools at our disposal!

 > Given the current state of available Scheme developers, I doubt more
 > than a few people would actually "review" the output of such tools...
 > Most will just "use" the output as is.
 > On the other side, the actual text of the HTML is reviewed at least by
 > the author and the SRFI index editor.

I agree that it would definitely be better to have visible markup
reviewed by the SRFI author in the final format, but in the meantime
for old SRFIs, the choice is between an index incompletely reviewed by
third parties and no index :)

 > I still think we can use HTML parsers instead of requiring strict
 > XHTML parsing, though.

Like Ciprian, I would be in favor of XHTML eventually because it's
easier to parse (XML libraries are absolutely everywhere and are more
reliable than tagsoup). We may be able to use a tagsoup library to
automatically convert HTML to XHTML.

 > I am all in favor of having more volunteers! Unless the conversion
 > is automated, I would prefer not to do it all myself.

I can empathize with this. I was thinking along the lines of there
being on average one new SRFI per month to format, which ought to take
30-60 minutes if the source HTML is reasonable. And if no-one can make
it on a particular month, there would be no time pressure since that
SRFI can simply be formatted later.

In my mind, the big issue re: volunteers would be the huge backlog of
existing SRFIs. But again there's no time pressure. If it takes 3
years to convert the whole backlog the result will still be useful.

Perhaps it would be best to first settle on the right format for new
SRFIs. Then once we convert the old ones, we'd stand a chance to get
the conversion right once and not have to go back and do a second
round of fixups to them.

 > We would have to agree on the format in which the index was stored,
 > e.g. Ciprian's suggested format, so that more than one tool could
 > make use of that data.

I would prefer to have the SRFI source documents in a standard format,
and then have an API server built upon that foundation. The API could
serve, index and query them in a number of different ways, whatever
people need. The API server would be open source so hardcode users can
also use it locally as a library. But I would strongly favor a "single
point of truth" in the documents themselves instead of a separately
maintained index (indexes auto-generated from the source documents are
fine, in any format). This based on observing a kind of Murphy's Law
that "Any information that can go out of sync, will go out of sync" :)

 > In my view, if you just want to "index" the existing body of SRFI
 > documents, then adding HTML classes to all the existing documents is
 > more work than taking one of the already "crawled" lists (by you, by
 > me, or by others), and just double-checking them to make sure they
 > are "complete" and "correct".
 > Afterwards, starting from this index one could write a small program
 > that goes back to the SRFI, and based on some simple heuristics
 > manages to back-reference the place where these definitions are
 > "defined".

Not sure about this. I tried marking up a couple SRFIs with classes
and it was quite swift and easy work once I got the hang of it. Once
it becomes second nature, double-checking will be almost the same act
as just writing the extra tags/classes. (When I was writing the tags,
I continually ran a script to regenerate the index, so I was looking
at the source and the index side by side.) Particularly if the results
go right into the source documents, so one doesn't have to worry about
which version of which index one is using (just auto-generate index
from source).

This may just be some kind of bias in my personality and experiences
but I strongly feel that things are simpler if there is one officially
blessed place where we have the ground truth. I always get confused
managing different versions of things in different places. I can't
make an objective case for why I prefer this, apart from having many
experiences where it has proven simpler.

 > Namely when they are "ready-for-being-published", we just export
 > them into HTML and annotate them ourselves.
 > And this is also why the XHTML approach won't be to cumbersome. We
 > can just "translate" even the HTML ones in XHTML making sure they
 > are conformant.

 > My suggestion for XHTML is purely pragmatical: XHTML is XML, then
 > one can just use any XML library to parse the document.
 > Now I know that it "seems" that there are many HTML parsers out
 > there, unfortunately this is not true... There are a few, at least
 > for the most popular programming languages, however they are
 > "bloated" and full of issues...

I fully agree with these points. In particular, it'd be very nice if
the final format doesn't require a tagsoup parser library, but can use
a strict XML parser instead. IMHO it's fine if we use tagsoup to do
the initial conversion from the author's HTML to the final format, but
once an SRFI is in the final format, it would be nice to let tool
writers use simple and reliable parsers.

 > Thus I think that there are two different problems:
 > * "new" SRFI documents that have to be "structured", "annotated" and
 > * "old" SRFI documents that can be just "back-referenced" starting
 > from an existing "index";
 > * (and a third problem) "transforming" the old SRFI documents so that
 > they are in-line with the newly proposed format;

This is a good delineation :)

 > However thinking from a practical point-of-view, especially given
 > how "rudimentary" some XML/HTML parsing libraries are for various
 > languages, I think it's easier to search for just `^(.*[
 > ]+)?def-proc([ ]+.*)$` than trying to imagine in how many ways one
 > can combine the classes `def` and `proc`.
 > A good exercise for this "annotation" experiment I think would be:
 > * take the whole HTML file, and replace newlines with spaces;  `tr
'\n' ' '`;
 > * write a `grep`-based line to extract, say all procedure names;

I would think it best not to worry about grep and regexps. In
principle it's nice to be able to grep anything, but those
line-oriented tools are so hopelessly ill-suited for dealing with
XML's nesting, quoting, escaping and variations in whitespace that
it's almost impossible to write anything reliable using them. I've
done that every once in a while for over a decade out of laziness, and
they always fail even with a simple corpus of text. XML/HTML tree
walking libraries are so widespread and easy to use now that I would
just start from the assumption that tools can access them.

There are also some tools that convert XML to a line-oriented syntax
for grepping, e.g.

One point in favor of using single classes as you suggest, is that
XPath expressions don't support multiple classes easily. I don't know
how big a deal that is but it's good to keep in mind and explore. In
general, I would concede that XHTML documents support multiple classes
per element, and people commonly do that, so if we don't permit these
XHTML conventions like this it is likely to confuse people. Then
again, if we have a strict lint tool to warn people about it, problems
will be easier to avoid. Let's experiment until we find something
palatable :p

 > in my view the editor takes the text and more "actively" formats it,
 > just like the editor of a book or magazine would.

Arthur would be in the best position to estimate whether the workload
is reasonable. Having a very involved SRFI editor would obviously
produce the best documents, but volunteer work can also be pretty
draining and thankless at times, and most people need breaks and
lighter periods now and then to keep motivated.

 > > Would you and Lassi be willing to prepare a proposal together and
 > > submit it to <srfi-discuss> for public discussion?
 > At the moment I think some more "brainstorming" would be better.

I agree with Ciprian -- it would probably be best to brainstorm and
iterate on some prototype scripts and (X)HTML structures until we find
something that works fairly well. There are too many unknowns to plan
ahead in detail. But once we have something functional then a formal
proposal would definitely be in order :)

Should we have more beta testers though? At least Per Bothner
expressed some measure of interest in one of last year's threads.

 > BTW, the IETF has just gone through a similar approach by moving
 > RFC's from plain text to an XML-based format:
 > Perhaps we could just reuse their tools? (They are / will be better
 > maintained than anything we can come up with.)

That's great news :)

Here's one example of their XML format:

Personally, I'd rather start with XHTML and add/change only what we
need. XHTML is familiar and has short tag names for everything. These
custom XML documentation formats tend to be quite verbose and
tailor-made for a specific type of document (I've long avoided DocBook
for the same reason -- just seems over-engineered to my taste).

In my experience, quality of tools is strongly correlated with the
simplicity of the format. People love to write lots of tools for
simple formats because it's immediately rewarding. On the other hand,
complex things tend to have poor tooling even with lots of industry
backing. So IMHO the priority would be to make the format simple and
familiar (hence similar to HTML). Not saying the IETF RFC XML is too
complex but it seems quite verbose and divergent from HTML.

I would suggest roughly the following design approach:

* Start with some version of XHTML
* Use only the XHTML tags we need in a rigid structure
* Add class attributes to signify everything else
* Specify a standard CSS stylesheet using those tags and classes