Email list hosting service & mailing list manager

(missing)
(missing)
Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Ciprian Dorin Craciun (05 Mar 2019 19:38 UTC)
(missing)
Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Marc Nieper-Wißkirchen (06 Mar 2019 10:12 UTC)

Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Ciprian Dorin Craciun 05 Mar 2019 19:37 UTC

[I'm merging the replies to both Lassi and Arthur.]

> @Lassi
> I like everything you suggest here. The problem is the large body of
> existing SRFIs with hand-crafted HTML. I propose to start by adding the
> classes because that requires only a few changes so it's easier to get
> people on board and get it done sooner.
>
> I empathize with your use case because while classes are sufficient for
> indexing, transforming the full text of an SRFI into another format is
> still tricky without standardized use of tags.

> @Arthur
> I'm interested in the idea of adding HTML classes as annotations, too.  It has been proposed before, and the only big drawback is that it will be a lot of work to to add the annotations to our 166 prior SRFIs, or even the finalized subset of them.  But the work will yield a way to produce a comprehensive index automatically, and that's worth a lot.

In my view, if you just want to "index" the existing body of SRFI
documents, then adding HTML classes to all the existing documents is
more work than taking one of the already "crawled" lists (by you, by
me, or by others), and just double-checking them to make sure they are
"complete" and "correct".

Afterwards, starting from this index one could write a small program
that goes back to the SRFI, and based on some simple heuristics
manages to back-reference the place where these definitions are
"defined".

Thus I think that there are two different problems:
* "new" SRFI documents that have to be "structured", "annotated" and "indexed";
* "old" SRFI documents that can be just "back-referenced" starting
from an existing "index";
* (and a third problem) "transforming" the old SRFI documents so that
they are in-line with the newly proposed format;

> @Lassi
> > I would try to keep these classes as mutually exclusive as possible...
> > Especially if we want to be able to extract anything useful out of
> > these documents.
>
> > As highlighted above I would use `def-proc` (i.e. one CSS class)
> > instead of `def proc` (i.e. two CSS classes).  (Because what means
> > just `proc` by itself?  Or just `def`?)
>
> I'm not sure I understand what you mean. I thought about this and came
> to the opposite conclusion, that having separate classes makes
> extracting information *easier*, not more difficult :) Because it's
> easier to specify precisely what you want by composing from a small set
> of classes as needed.

Although from a theoretical point of view I agree that having `def`,
`proc` and `syntax` provides more expresivity than just `def-proc` and
`def-syntax`.

However thinking from a practical point-of-view, especially given how
"rudimentary" some XML/HTML parsing libraries are for various
languages, I think it's easier to search for just `^(.*[
]+)?def-proc([ ]+.*)$` than trying to imagine in how many ways one can
combine the classes `def` and `proc`.

A good exercise for this "annotation" experiment I think would be:
* take the whole HTML file, and replace newlines with spaces;  `tr '\n' ' '`;
* write a `grep`-based line to extract, say all procedure names;

> @Lassi
> > Regarding the `display: none` I think it is a very bad idea...  If it
> > is not visible, it will not be reviewed, and thus it will bit-rot, and
> > errors will creep into that element.
>
> The invisible tags would get reviewed because tools would use them to
> index the SRFIs. So you would review an SRFI by looking at its index
> entries, then fix any incorrect tags until the index is accurate.

Given the current state of available Scheme developers, I doubt more
than a few people would actually "review" the output of such tools...
Most will just "use" the output as is.

On the other side, the actual text of the HTML is reviewed at least by
the author and the SRFI index editor.

> @Arthur
> One requirement I have for any proposal is that the effort required by an author to comply is minimal.  Part of the reason that the SRFI process has been successful for twenty years is that the editors have kept "friction" low.  Adding a few classes and <span>s shouldn't be much effort.
>
> Note that some authors write their documents in another markup language, then convert them to HTML.  This proposal would require that they update their software to produce the new classes, or edit the generated HTML afterwards.  That shouldn't stop us from trying this proposal, but it's something to keep in mind.

> @Lassi
> Indeed, there are 58 authors so tracking them down would be a task in
> itself! I wonder how many different generator programs have been used?
>
> I think the bottom line is that no matter what metadata format we come
> up with, it will be impossible to generate it from some of the programs
> the authors used. So if we demand a very high rate of compatibility then
> our project is dead in the water.

Indeed, handling these non-HTML-native SRFI's can take the same
approach as the "old" ones.  Namely when they are
"ready-for-being-published", we just export them into HTML and
annotate them ourselves.

And this is also why the XHTML approach won't be to cumbersome.  We
can just "translate" even the HTML ones in XHTML making sure they are
conformant.

For example I took this approach when I've converted the R7RS LaTeX
document into CommonMark:  it took me about two days of
find-and-replace with regular expressions and it was done.

Thus my proposal would be the following:
* we come-up with some "document format";  (regardless at this moment
what it is;)
* those editors that want to use this format can do so;
* for those editors that prefer HTML / LaTeX or something else, we
just take the final version of their SRFI, and manually convert it;

Ciprian.