Email list hosting service & mailing list manager

(missing)
(missing)
(missing)
Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Marc Nieper-Wißkirchen (06 Mar 2019 10:12 UTC)
Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Ciprian Dorin Craciun (10 Mar 2019 20:49 UTC)

Re: Proposal to add HTML class attributes to SRFIs to aid machine-parsing Ciprian Dorin Craciun 10 Mar 2019 20:48 UTC

On Thu, Mar 7, 2019 at 10:17 PM Ciprian Dorin Craciun
<xxxxxx@gmail.com> wrote:
> As it stands I have tried to eliminate as much HTML as possible (which
> was used mainly for formatting purposes).  In the next days I'll take
> a look at how I can change the annotations to make it more maleable to
> exports and indexing.
>
> Just a note on my approach:  my focus is mainly on how we can use
> XHTML to obtain the following:
> * easy indexing and back-referencing of the elements;
> * easy export into other formats (like Markdown), thus the used
> (X)HTML elements should be kept to a minimum;
> * easy splitting of the text into sections and definitions so that one
> can programatically extract only that section;

I've continued my "restructuring" experiment and the following is the
current outcome (still work in progress):

    https://scratchpad.volution.ro/ciprian/1d4efeeb7db3cc7bb208d449c86f895f/structured/srfi-1.html

Basically what I done so far is cleaned the HTML into XHTML Basic 1.1
and highly structured it (by just adding markup elements, without
rewording or changing the actual text).  (I.e. instead of HTML classes
I've actually just used `<var>`, `<dfn>` and correctly nested list
items, definition terms, etc.)

Based on this experience (which took quite some while), I've made the
following observations:

(1)  HTML is a very unpleasant language to work with.  (Alex is right!)

The only way I was able to work with it was to open the HTML in a
browser side-by-side with the text editor.  Also my "structured" CSS
helped a lot in detecting incorrectly nesting, `<var>` elements
containing more than one "variable", etc.

However it can be done.

(2)  As SRFI-1 stands, the HTML markup was mainly used for formatting
and not "semantic"...  For example in case of procedure definition
arguments, a single `<var>` was used to contain all arguments, instead
of one `<var>` per argument.  (The same for many other elements.)

Therefore the "quality" of the HTML code is (as I expected) far from
the quality we expect for example from our Scheme code...  (I'm not
criticizing the authors and publishers, as the main purpose of HTML
was for WYSIWYG purposes, thus nobody focused on the actual HTML code
itself.)

(3)  There are two approaches in "augmenting" the current SRFI HTML's:

* either we take the HTML code as is, and use HTML classes to markup
various elements, hoping that afterwards we just run a tool that
"massages" the whole mess and output something more than an index of
elements;  (this is the approach Lassi took;)

* or we take the HTML code and restructure it into a more "strict"
hierarchy, with some clear "patterns", and afterwards run a tool to
extract and augment that HTML into the final product;  (this is the
approach I'm proposing;)

(4)  As hinted above, I think that just "massaging" the original HTML
(in either variant) is not enough.  I think there should be an extra
step (with an automated tool), that takes the "massaged" HTML and
outputs another one used for "publishing" purposes.

What do I mean by this?  Well in my approach I wouldn't ask the author
/ editor to set the `id="cons"` tag where the `cons` function is
defined, but instead the automated tool, based on finding
`<dfn><code>cons</code><dfn>` would create that id on the `<dd>`
element containing the definition.  At the same time, whenever it
finds an `<code>cons</code>` it would automatically wrap that in a `<a
href="#cons"><code>cons</code></a>` for back-referencing.  (The same
would happen for bibliographical items, sections, etc.)

I propose this based on the observation that mandating the editor to
always set `id` and `<a href>` markup would drive one mad...

Therefore my workflow proposal is as follows:
* once the author / editor moves a SRFI in the final status, the
workflow begins;
* the original HTML is taken and a few automatic "changes" are done
that cleanup the HTML (mainly `tidy`, but perhaps we can automate
something more;)
* the "volunteer" takes that HTML and restructures the elements into a
proper hierarchy (as I've done);
* the "volunteer" executes the automatic "generator" which augments
the XHTML with `id`, `href` tags and so on;

My next steps (perhaps in the next weekend) is to continue the
structuring and introduce the concept of "sections" (basically based
on `<div>`'s), and see how I can re-format the definitions so that
automatic "splitting" of the document is easy.

Also I'll try to come-up with a second CSS, that can be applied to the
same structured XHTML, but for the purpose of display.

> I've made a fork of SRFI-1 and started chopping and transforming the
> HTML into XHTML, at the same time eliminating some "boilerplate"
> elements (to focus only on the actual text), and changing some HTML.
> The changes can be seen at:
>
>   https://github.com/scheme-requests-for-implementation/srfi-1/compare/master...cipriancraciun:master

I've applied the following (not necessarily in order, see the diff
above for the actual order):

* converted everything to XHTML Basic 1.1 (it's a very lightweight and
constrained XHTML variant, that should be implemented even by the most
basic hand-helds...  and surprisingly the conversion required just a
few minor edits to be compliant...)
* used `tidy` to re-format everything;

* some minor non-semantic changes;
* replaced tables with lists;  (they were in fact used only for
appearance purposes;)
* removed all `<div>` elements as they were used only for display purposes;

* used `dfn` elements to mark where the procedure is defined;  (see
bellow the outcome;)
* split `<var>x y</var>` arguments into `<code><var>x</var>
<var>y</var></code>` so that the signature is structured;

    <dt><dfn><code>cons</code></dfn>
    <code><var>a</var> <var>d</var> -&gt; <var>pair</var></code></dt>

* replaced all `<a href="...">zzz</a>" with plain `<a>zzz</a>`, based
on the observation that these `<a>` elements are used for bibliography
purposes, thus they can be "transformed" afterwards;  (see bellow
about proposed workflow);
* removed all `id="..."` attributes, based on the observation that
given a "good" structured document, these can be added afterwards;
* removed all "proc-defn" (and similar), based on the observation that
they aren't used anywhere in the document, and if they would be used
for signatures it would be superfluous;

* removed all `<var>` (and other markup) from within `<pre>` based on
the observation that given the context where the `<pre>` appears (i.e.
which `<var>` are present in neighboring elements) we could re-add
this markup;

Ciprian.