Email list hosting service & mailing list manager

(missing)
(missing)
(missing)
Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Marc Nieper-Wißkirchen (06 Mar 2019 10:12 UTC)
Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Lassi Kortela (07 Mar 2019 09:10 UTC)

Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Lassi Kortela 07 Mar 2019 09:10 UTC

There seems to be a fair bit of opposition to argument list markup.
That's fine.

What about this approach:

- Only surround each definition in the HTML with class="proc def",
   class="syntax def", etc. (or "proc-def", "syntax-def"). No required
   tags or classes for the arguments.

- Write a tool to extract only the text inside those tags (including
   subtags). I.e. remove all markup from the text inside. Then parse
   that text as an S-expression (allowing for typography such as
   ellipsis for rest arguments, square brackets for optional arguments
   and angle brackets for placeholders).

- If an S-expression metadata doesn't exist for this SRFI, generate
   one based on the things parsed from the HTML. If the metadata file
   exists, verify that it matches what's in the HTML.

I got encouraging results with this little Python script:

     import sys
     from bs4 import BeautifulSoup

     soup = BeautifulSoup(sys.stdin.read(), "html.parser")
     for def_ in soup.select(".def"):
         print(def_.text)

Given SRFI-81 with only added markup like this:

     <span class="proc def">...</span>
     <span class="syntax def">...</span>

It prints output lilke this:

     (buffer-mode name)
     (buffer-mode? obj)
     (transcoder (codec codec) (eol-style eol-style))
     (update-transcoder old (codec codec) (eol-style eol-style))
     (eol-style lf)

There will probably be some SRFIs where the typography inside the
S-expressions is different, but if we can change that to be consistent
enough across all SRFIs, that ought to be one of the simplest ways to
go about generating an index.

 > Since classes are a major hook for CSS and CSS is the preferred
 > style of formatting of all but the simplest kind today, I think
 > that's unreasonable. *Some* classes may have no formatting
 > implications.

I meant that the output would be human-readable even styling the
classes, for bare-bones applications. It wouldn't necessarily be
pretty. But by using e.g. <b> for procedure names and <var> for
arguments (including optional and rest arguments as well as return
values) with no CSS classes at all and light typography (brackets,
arrows and ellipses) it already looks serviceable. This is close to
what's currently done in many SRFIs.