Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Marc Nieper-Wißkirchen (06 Mar 2019 10:12 UTC)
Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Lassi Kortela (06 Mar 2019 14:22 UTC)

Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Lassi Kortela 06 Mar 2019 14:22 UTC

 > My proposal is to keep things simple:
 > * for indexing just using `<a class="proc-def">make-array</a>` is enough;

 > * for actual signatures I think an S-expression based description is
 > better (however see the other paragraph where I note that perhaps this
 > is too much for SRFI's);  for example:
 >      (make-vector
 >          (type constructor)
 >          (export scheme:base)
 >          (signature
 >              ((range-length-zero) -> vector-empty)
 >              ((range-length-zero any) -> vector-empty)
 >              ((range-length-not-zero) -> vector-not-empty)
 >              ((range-length-not-zero any) -> vector-not-empty))
 >          ...

This is obviously far superior to any HTML-based approach, but would
have to be maintained in a separate file from the SRFI HTML.

I think our main point of disagreement is how much worse it is to have
a separate file. I'm willing to live with clumsy HTML markup if we can
have only one file per SRFI. You're willing to live with two files per
SRFI if we can have great metadata syntax (S-expr or JSON). Correct?

Could we get everyone's opinion on this issue, as it may be the
biggest detriment to forming a consensus? Who would rather have
somewhat clumsier markup but require only a single HTML file, and who
would rather have separate HTML and metadata files if it means the
HTML markup can be somewhat cleaner?

For the purposes of this vote, metadata would include at least
argument lists and one-line descriptions of all the procedures defined
in the SRFI. If we are only interested in procedure names, those can
easily be marked up in any number of unintrusive ways, so there
wouldn't be much controversy. It's the nested data that brings the

 > It is so complex:
 > * in order to identify that `make-array` is actually a procedure
 > definition, we have to look for an element that has the class
 > `h-proc-def`, which should contain (somewhere not necessarily in
 > direct children) an element with the class `p-name` whose attribute is
 > the actual "name" of the procedure;  just trying to think about
 > expressing this in code, especially with XML libraries or XSLT scares
 > me...  (for a second try just try to imagine how the code to extract
 > arguments looks like;)
 > * it provides too much overhead:  it has too much duplication, the
 > `make-array` token appears twice;
 > * it fails to capture all signature elements:  what is the output of
 > the procedure?  what are the types of various arguments?
 > When designing this format think about how one could use `pup` / `jq`
 > to extract the data.

I think we all agree HTML classes are not an ideal way to represent
information. But from my point of view the other options are even
worse :) That's why I've been advocating classes. If I've understood
correctly, Arthur and Per have a similar viewpoint.

For example, I imagine most schemers would find S-expressions superior
to XML in general. But we'd have to keep them in a separate file or
parse them from the bodies of HTML tags. If the SRFIs were in Scribe
format then clean S-expressions might be a no-brainer, but since HTML
has already been established, they would add complexity.

Likewise, having a separate sub-element for the name of a definition
is not ideal, but I think all other approaches are even less ideal.

 > I've tried hard to think about this problem (when I did my R7RS
 > documentation conversion) and came to the conclusion that one can't
 > expect to extract accurate information from "text" documents without
 > making a mess out of them.

That's probably true. Even the most compact HTML tags are verbose
compared to ordinary uses of S-expressions or JSON. It's hard to do
anything at all without resorting to <span class="x">y</span> or
similar. Since it's almost impossible to represent any kind of nested
data in an HTML attribute value without creating more problems than it
solves, any layer of nesting requires new sub-tags. This is probably
just something that has to be accepted if the HTML route is chosen.

But basically, you already have to write the procedure names and
argument names in the HTML, and they are often written in a special
font. Hence there is already a tag around them. If the markup is e.g.:

     [ <var>setter</var> ]

Then it's not a big step to add classes:

   <div class="proc def">
     <b class="name">make-array</b>
     <var class="arg">interval</var>
     <var class="arg">getter</var>
     [ <var class="opt arg">setter</var> ]

Or the equivalent microformat classes.

I don't think there is any way to avoid verbosity because no matter
what HTML-based approach we choose, we have to write <foo class="bar">
all over the place. Personally, I would argue that HTML is already
quite verbose even without any metadata. Hence elegant HTML is
something of a lost cause to begin with.

We could also use 'id=' and HTML5 'data' attributes, e.g.

   <div class="proc def" id="make-array">
   <div class="proc def" data-name="make-array">

but that's not necessarily simpler or more compatible. The procedure
name has to be visible in the SRFI anyway, often with special styling.
So we might as well re-use the visible text for the metadata.

Also, the HTML spec says that id attributes have to be unique in the
entire document. You can't have the same id on different elements even
if those elements have different tags and classes. So I wouldn't use
'id' for any of our metadata.

I also think invisible tags are unavoidable if we add significant
metadata into the old SRFIs. Even with microformats (there was a
mistake in the microformat example I posted: the <span>optional</span>
should have been hidden). For new SRFIs conforming to a rigid HTML
structure they can hopefully be avoided.

Invisible tags can maybe be avoided with creative uses of
data/id/rel/rev attributes, but that probably creates even more
problems. Many tags can be made visible, but attributes are always

 > However this is perhaps too-much for the SRFI use-case.  Instead I
 > think just having a few "markers" to allow indexing /
 > back-referencing, then a simplified / standard structure (sections,
 > paragraphs, lists, code snippets, etc.) is enough.  Based on this one
 > can take the (X)HTML and "render" it as CommonMark / other formats to
 > be included in his own documentation.

We may also disagree on what constitutes "few" or "many" markers :)

 > (Without upsetting anyone) I really think that this is the best
 > example on how to fail at this endeavor.

No problem :) I'm not sensitive to criticism.