On Thu, Mar 7, 2019 at 10:17 PM Ciprian Dorin Craciun <xxxxxx@gmail.com> wrote: > As it stands I have tried to eliminate as much HTML as possible (which > was used mainly for formatting purposes). In the next days I'll take > a look at how I can change the annotations to make it more maleable to > exports and indexing. > > Just a note on my approach: my focus is mainly on how we can use > XHTML to obtain the following: > * easy indexing and back-referencing of the elements; > * easy export into other formats (like Markdown), thus the used > (X)HTML elements should be kept to a minimum; > * easy splitting of the text into sections and definitions so that one > can programatically extract only that section; I've continued my "restructuring" experiment and the following is the current outcome (still work in progress): https://scratchpad.volution.ro/ciprian/1d4efeeb7db3cc7bb208d449c86f895f/structured/srfi-1.html Basically what I done so far is cleaned the HTML into XHTML Basic 1.1 and highly structured it (by just adding markup elements, without rewording or changing the actual text). (I.e. instead of HTML classes I've actually just used `<var>`, `<dfn>` and correctly nested list items, definition terms, etc.) Based on this experience (which took quite some while), I've made the following observations: (1) HTML is a very unpleasant language to work with. (Alex is right!) The only way I was able to work with it was to open the HTML in a browser side-by-side with the text editor. Also my "structured" CSS helped a lot in detecting incorrectly nesting, `<var>` elements containing more than one "variable", etc. However it can be done. (2) As SRFI-1 stands, the HTML markup was mainly used for formatting and not "semantic"... For example in case of procedure definition arguments, a single `<var>` was used to contain all arguments, instead of one `<var>` per argument. (The same for many other elements.) Therefore the "quality" of the HTML code is (as I expected) far from the quality we expect for example from our Scheme code... (I'm not criticizing the authors and publishers, as the main purpose of HTML was for WYSIWYG purposes, thus nobody focused on the actual HTML code itself.) (3) There are two approaches in "augmenting" the current SRFI HTML's: * either we take the HTML code as is, and use HTML classes to markup various elements, hoping that afterwards we just run a tool that "massages" the whole mess and output something more than an index of elements; (this is the approach Lassi took;) * or we take the HTML code and restructure it into a more "strict" hierarchy, with some clear "patterns", and afterwards run a tool to extract and augment that HTML into the final product; (this is the approach I'm proposing;) (4) As hinted above, I think that just "massaging" the original HTML (in either variant) is not enough. I think there should be an extra step (with an automated tool), that takes the "massaged" HTML and outputs another one used for "publishing" purposes. What do I mean by this? Well in my approach I wouldn't ask the author / editor to set the `id="cons"` tag where the `cons` function is defined, but instead the automated tool, based on finding `<dfn><code>cons</code><dfn>` would create that id on the `<dd>` element containing the definition. At the same time, whenever it finds an `<code>cons</code>` it would automatically wrap that in a `<a href="#cons"><code>cons</code></a>` for back-referencing. (The same would happen for bibliographical items, sections, etc.) I propose this based on the observation that mandating the editor to always set `id` and `<a href>` markup would drive one mad... Therefore my workflow proposal is as follows: * once the author / editor moves a SRFI in the final status, the workflow begins; * the original HTML is taken and a few automatic "changes" are done that cleanup the HTML (mainly `tidy`, but perhaps we can automate something more;) * the "volunteer" takes that HTML and restructures the elements into a proper hierarchy (as I've done); * the "volunteer" executes the automatic "generator" which augments the XHTML with `id`, `href` tags and so on; My next steps (perhaps in the next weekend) is to continue the structuring and introduce the concept of "sections" (basically based on `<div>`'s), and see how I can re-format the definitions so that automatic "splitting" of the document is easy. Also I'll try to come-up with a second CSS, that can be applied to the same structured XHTML, but for the purpose of display. > I've made a fork of SRFI-1 and started chopping and transforming the > HTML into XHTML, at the same time eliminating some "boilerplate" > elements (to focus only on the actual text), and changing some HTML. > The changes can be seen at: > > https://github.com/scheme-requests-for-implementation/srfi-1/compare/master...cipriancraciun:master I've applied the following (not necessarily in order, see the diff above for the actual order): * converted everything to XHTML Basic 1.1 (it's a very lightweight and constrained XHTML variant, that should be implemented even by the most basic hand-helds... and surprisingly the conversion required just a few minor edits to be compliant...) * used `tidy` to re-format everything; * some minor non-semantic changes; * replaced tables with lists; (they were in fact used only for appearance purposes;) * removed all `<div>` elements as they were used only for display purposes; * used `dfn` elements to mark where the procedure is defined; (see bellow the outcome;) * split `<var>x y</var>` arguments into `<code><var>x</var> <var>y</var></code>` so that the signature is structured; <dt><dfn><code>cons</code></dfn> <code><var>a</var> <var>d</var> -> <var>pair</var></code></dt> * replaced all `<a href="...">zzz</a>" with plain `<a>zzz</a>`, based on the observation that these `<a>` elements are used for bibliography purposes, thus they can be "transformed" afterwards; (see bellow about proposed workflow); * removed all `id="..."` attributes, based on the observation that given a "good" structured document, these can be added afterwards; * removed all "proc-defn" (and similar), based on the observation that they aren't used anywhere in the document, and if they would be used for signatures it would be superfluous; * removed all `<var>` (and other markup) from within `<pre>` based on the observation that given the context where the `<pre>` appears (i.e. which `<var>` are present in neighboring elements) we could re-add this markup; Ciprian.