Email list hosting service & mailing list manager

Proposal to add HTML class attributes to SRFIs to aid machine-parsing Lassi Kortela (05 Mar 2019 14:01 UTC)
(missing)
(missing)
(missing)
Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Marc Nieper-Wißkirchen (06 Mar 2019 10:12 UTC)

Proposal to add HTML class attributes to SRFIs to aid machine-parsing Lassi Kortela 05 Mar 2019 14:00 UTC

Hello,

This is a proposal to gradually add a standardized set of HTML class
attributes to the SRFI source documents. The classes would encode
metadata that can be used to index all symbols defined in SRFIs.
General information about the SRFI (date, author, abstract, status,
license, etc.) could also be encoded.

I made a simple SRFI symbol index (<https://schemedoc.herokuapp.com/>)
by screen-scraping the source HTML of all SRFI documents. I wasn't
aware that Ciprian Craciun had already done similar work earlier
(<https://github.com/cipriancraciun/scheme-srfi-index>). These indexes
are useful but contain errors and omissions because scrapers have to
guess where the definitions are in the SRFI.

Scraping could be made perfectly reliable by adding HTML class
attributes to the SRFI source HTML. I emailed Arthur Gleckler about
this and his initial response was enthusiastic but he suggested that
we have a wider discussion on this mailing list. He also pointed me to
earlier threads started by Ciprian Craciun on this list in 2018:

* "Is there an index of symbols defined by the various SRFI's?"
   <https://srfi-email.schemers.org/srfi-discuss/msg/8163553>

* "Describing Scheme libraries (and thus SRFI's and R7RS) in a
   "machine readable" format (and rendering in various formats)"
   <https://srfi-email.schemers.org/srfi-discuss/msg/8932119>

The approach described here would be complementary. Ciprian has been
working on an S-expression-based layout for the metadata: the
S-expressions could be generated automatically from the HTML markup
proposed here. In fact, Arthur Gleckler and Per Bothner already hinted
at an HTML-based approach in the earlier thread.

Why use HTML class attributes instead of another approach?

Because they are versatile:

* The overall HTML structure of the existing SRFI documents is
   somewhat variable. HTML tags are used differently in different
   SRFIs. If we mandated particular HTML tags for particular metadata
   (e.g. always using the <code> tag to mark procedure names) that
   would require somewhat disruptive changes to existing SRFIs. In
   contrast, HTML allows any set of classes on any tag. This means each
   SRFI author can keep using the tags they are accustomed to, simply
   adding the right classes to them so parsers can extract metadata.

* Some SRFI authors define their own classes. If we pick class names
   that nobody is using yet, the new class names can co-exist
   peacefully with the existing custom classes. Custom and standard
   classes can even co-exist on the same element.

* HTML has the <div> block-level tag and the <span> inline tag that
   can be used to add classes to text without disrupting its visual
   appearance. Furthermore, <span style="display: none"> can be used to
   add completely invisible text only for machines to read.

* Since the classes are invisible, they are easy to add gradually.
   There is no harm done by missing classes: parsers will simply do
   without that metadata. For example, if a SRFI doesn't have the
   classes in its procedure definitions, the parser will simply not
   index the procedures in that SRFI. Once the classes are added, the
   procedures will be added to the index. This means we could add
   metadata on a very flexible schedule without disrupting users.

* The classes are easy to design such that one can start by adding
   only some of them. E.g. if we have classes to specify the name and
   arguments of each defined procedure, a busy SRFI author could simply
   add the classes for the procedure names and leave out the classes
   for the arguments. A volunteer could then go in later and add the
   markup for the arguments. Even while the argument markup is missing,
   users would still get the benefit of indexing the procedure names,
   which is a lot better than nothing. So every part of the process is
   done on an opt-in basis rather than opt-out.

* HTML classes double as CSS classes so they can be used for styling
   if an author finds that convenient. But styling classes can also be
   completely separate from these metadata classes.

And reliable:

* Putting the metadata right in the SRFI source HTML instead of
   maintaining separate metadata files would confer the advantage of
   having a "single point of truth" for things related to the SRFI.
   Separate metadata can more easily get out of sync, particularly if
   several versions of it are floating around the net. It's more likely
   to be in sync if it's auto-generated from the source document.

* HTML tags are readily arranged into a tree. A tree structure is
   excellent for metadata. We can e.g. have a "procedure definition"
   tag with sub-tags for the name and arguments of the procedure.

* Many popular programming languages have HTML/XML parsing libraries
   that can be used to query HTML nodes based on class relationships.
   They have operations like "find all subnodes of node N with class
   C". By leveraging these popular tools we can easily write reliable
   parsers. I wrote a parser as a proof of concept using Python and
   BeautifulSoup. It was remarkably easy to write clean code - none of
   the usual regexp fare and heuristics were necessary. A JavaScript
   implementation in a web browser might even be able to parse out
   metadata classes directly from the HTML DOM with no extra library.

* Since classes are completely custom with no necessary relation to
   HTML structure, they are immune to long-term changes in the HTML
   spec, or conversions to other formats. (The classes can be preserved
   when converting to any format that has a similar mechanism to attach
   arbitrary metadata to parse-tree nodes.)

* Alternatively we could embed HTML comments in the SRFIs and parse
   them, but I would argue that parsing comments is a classic pitfall
   best avoided where possible. We could also use a separate XML
   namespace for metadata but since we're currently using HTML instead
   of XHTML it will be easier to go with attributes on the HTML tags
   themselves. Even if we somehow managed to mandate XHTML, the
   multi-namespace solution would still end up more complex than using
   HTML classes.

I played around with a class-based representation for about an hour.
The results are here: <https://github.com/lassik/srfi-markup-proposal>
In particular, this Git commit shows some differences between SRFIs in
their published form and the ones with the new metadata classes:
<https://github.com/lassik/srfi-markup-proposal/commit/b005195b47fca1c7e459e01e76a4f2b3f9e04b34>

The file at the top of the commit (scrape-output.text) shows what
metadata the scraper script was able to find about the SRFIs. I only
marked up some of the procedures in each SRFIs, not all of them, since
I wanted to get feedback before putting in more work.

It wasn't trivial to find an intuitive set of classes, but I currently
have these:

* def -- this tag contains metadata about a definition
* def proc -- it's a procedure definition
* name -- (inside def) its text is the symbol being defined
* blurb -- (inside def) its text is a one-line description
* arg -- (inside def proc) its text is the name of a procedure argument
* arg opt -- (inside def proc) this is an optional argument
* arg rest -- (inside def proc) this is a rest parameter (...)
* ret -- (inside def proc) its text is the name of a return value

What's elegant about the HTML class system is that it's based on
"composition, not inheritance". So you can freely combine classes in
ways that feel natural. For example, I made a "hidden" class as a
shorthand for the CSS "display: none" that makes things invisible. So
the visible name of a definition is:

     <span class="name">cons</span>

But if the name needs to be hidden from view due to markup
technicalities, we can do:

     <span class="hidden name">cons</span>

Similarly we can use the "def" class to find all definitions in the
SRFI, but "def proc" to find only procedure definitions, or "def var"
to find only variable definitions, etc.

I personally think HTML classes are quite elegant for a hack like this
:-p The HTML is necessarily more verbose than the published SRFIs, but
in my very biased opinion the tradeoff is worth it. What do you think?

Kind regards
Lassi Kortela