Hello, This is a proposal to gradually add a standardized set of HTML class attributes to the SRFI source documents. The classes would encode metadata that can be used to index all symbols defined in SRFIs. General information about the SRFI (date, author, abstract, status, license, etc.) could also be encoded. I made a simple SRFI symbol index (<https://schemedoc.herokuapp.com/>) by screen-scraping the source HTML of all SRFI documents. I wasn't aware that Ciprian Craciun had already done similar work earlier (<https://github.com/cipriancraciun/scheme-srfi-index>). These indexes are useful but contain errors and omissions because scrapers have to guess where the definitions are in the SRFI. Scraping could be made perfectly reliable by adding HTML class attributes to the SRFI source HTML. I emailed Arthur Gleckler about this and his initial response was enthusiastic but he suggested that we have a wider discussion on this mailing list. He also pointed me to earlier threads started by Ciprian Craciun on this list in 2018: * "Is there an index of symbols defined by the various SRFI's?" <https://srfi-email.schemers.org/srfi-discuss/msg/8163553> * "Describing Scheme libraries (and thus SRFI's and R7RS) in a "machine readable" format (and rendering in various formats)" <https://srfi-email.schemers.org/srfi-discuss/msg/8932119> The approach described here would be complementary. Ciprian has been working on an S-expression-based layout for the metadata: the S-expressions could be generated automatically from the HTML markup proposed here. In fact, Arthur Gleckler and Per Bothner already hinted at an HTML-based approach in the earlier thread. Why use HTML class attributes instead of another approach? Because they are versatile: * The overall HTML structure of the existing SRFI documents is somewhat variable. HTML tags are used differently in different SRFIs. If we mandated particular HTML tags for particular metadata (e.g. always using the <code> tag to mark procedure names) that would require somewhat disruptive changes to existing SRFIs. In contrast, HTML allows any set of classes on any tag. This means each SRFI author can keep using the tags they are accustomed to, simply adding the right classes to them so parsers can extract metadata. * Some SRFI authors define their own classes. If we pick class names that nobody is using yet, the new class names can co-exist peacefully with the existing custom classes. Custom and standard classes can even co-exist on the same element. * HTML has the <div> block-level tag and the <span> inline tag that can be used to add classes to text without disrupting its visual appearance. Furthermore, <span style="display: none"> can be used to add completely invisible text only for machines to read. * Since the classes are invisible, they are easy to add gradually. There is no harm done by missing classes: parsers will simply do without that metadata. For example, if a SRFI doesn't have the classes in its procedure definitions, the parser will simply not index the procedures in that SRFI. Once the classes are added, the procedures will be added to the index. This means we could add metadata on a very flexible schedule without disrupting users. * The classes are easy to design such that one can start by adding only some of them. E.g. if we have classes to specify the name and arguments of each defined procedure, a busy SRFI author could simply add the classes for the procedure names and leave out the classes for the arguments. A volunteer could then go in later and add the markup for the arguments. Even while the argument markup is missing, users would still get the benefit of indexing the procedure names, which is a lot better than nothing. So every part of the process is done on an opt-in basis rather than opt-out. * HTML classes double as CSS classes so they can be used for styling if an author finds that convenient. But styling classes can also be completely separate from these metadata classes. And reliable: * Putting the metadata right in the SRFI source HTML instead of maintaining separate metadata files would confer the advantage of having a "single point of truth" for things related to the SRFI. Separate metadata can more easily get out of sync, particularly if several versions of it are floating around the net. It's more likely to be in sync if it's auto-generated from the source document. * HTML tags are readily arranged into a tree. A tree structure is excellent for metadata. We can e.g. have a "procedure definition" tag with sub-tags for the name and arguments of the procedure. * Many popular programming languages have HTML/XML parsing libraries that can be used to query HTML nodes based on class relationships. They have operations like "find all subnodes of node N with class C". By leveraging these popular tools we can easily write reliable parsers. I wrote a parser as a proof of concept using Python and BeautifulSoup. It was remarkably easy to write clean code - none of the usual regexp fare and heuristics were necessary. A JavaScript implementation in a web browser might even be able to parse out metadata classes directly from the HTML DOM with no extra library. * Since classes are completely custom with no necessary relation to HTML structure, they are immune to long-term changes in the HTML spec, or conversions to other formats. (The classes can be preserved when converting to any format that has a similar mechanism to attach arbitrary metadata to parse-tree nodes.) * Alternatively we could embed HTML comments in the SRFIs and parse them, but I would argue that parsing comments is a classic pitfall best avoided where possible. We could also use a separate XML namespace for metadata but since we're currently using HTML instead of XHTML it will be easier to go with attributes on the HTML tags themselves. Even if we somehow managed to mandate XHTML, the multi-namespace solution would still end up more complex than using HTML classes. I played around with a class-based representation for about an hour. The results are here: <https://github.com/lassik/srfi-markup-proposal> In particular, this Git commit shows some differences between SRFIs in their published form and the ones with the new metadata classes: <https://github.com/lassik/srfi-markup-proposal/commit/b005195b47fca1c7e459e01e76a4f2b3f9e04b34> The file at the top of the commit (scrape-output.text) shows what metadata the scraper script was able to find about the SRFIs. I only marked up some of the procedures in each SRFIs, not all of them, since I wanted to get feedback before putting in more work. It wasn't trivial to find an intuitive set of classes, but I currently have these: * def -- this tag contains metadata about a definition * def proc -- it's a procedure definition * name -- (inside def) its text is the symbol being defined * blurb -- (inside def) its text is a one-line description * arg -- (inside def proc) its text is the name of a procedure argument * arg opt -- (inside def proc) this is an optional argument * arg rest -- (inside def proc) this is a rest parameter (...) * ret -- (inside def proc) its text is the name of a return value What's elegant about the HTML class system is that it's based on "composition, not inheritance". So you can freely combine classes in ways that feel natural. For example, I made a "hidden" class as a shorthand for the CSS "display: none" that makes things invisible. So the visible name of a definition is: <span class="name">cons</span> But if the name needs to be hidden from view due to markup technicalities, we can do: <span class="hidden name">cons</span> Similarly we can use the "def" class to find all definitions in the SRFI, but "def proc" to find only procedure definitions, or "def var" to find only variable definitions, etc. I personally think HTML classes are quite elegant for a hack like this :-p The HTML is necessarily more verbose than the published SRFIs, but in my very biased opinion the tradeoff is worth it. What do you think? Kind regards Lassi Kortela