Re: We need a Pandoc implementation in Scheme

Show/hide message thread

We need a Pandoc implementation in Scheme Lassi Kortela (12 Jun 2021 07:24 UTC)

Re: We need a Pandoc implementation in Scheme Amirouche Boubekki (21 Jun 2021 12:19 UTC)

Re: We need a Pandoc implementation in Scheme Lassi Kortela (28 Jun 2021 19:29 UTC)

Re: We need a Pandoc implementation in Scheme Duy Nguyen (29 Jun 2021 03:43 UTC)

Re: We need a Pandoc implementation in Scheme Arthur A. Gleckler (29 Jun 2021 04:36 UTC)

Re: We need a Pandoc implementation in Scheme Lassi Kortela (29 Jun 2021 09:18 UTC)

Re: We need a Pandoc implementation in Scheme Lassi Kortela 28 Jun 2021 19:29 UTC

Amirouche, sorry about the long delay in responding to this!

>> A number of goals are converging on the general requirement that we need
>> a modular Pandoc (https://en.wikipedia.org/wiki/Pandoc) clone written in
>> portable Scheme.
>
> Why is it required or necessary?

Many of the web pages under Scheme.org are written using Markdown or
AsciiDoc. And if we want a documentation index/browser under Scheme.org,
we need to parse the manuals of Scheme implementations and libraries;
those are written in Texinfo, Scribble, and others.

It is prohibitively hard to do even simple transformations to that
content (e.g. styling it via CSS, and adding navigation elements to the
HTML page) if we call out to converter programs written in other
languages (since those converters do not output an S-expression
representation which is easy for us to process). It soon gets to the
point where the easiest solution is to have the necessary parsers in
Scheme libraries instead.

> About SXML tools, it seems clear to me that nowadays xpath hence
> sxpath is dead or will die. Most new blood rely on CSS selectors to
> query HTML (and XML afaik non-existent nowadays).

Good insight! I don't know much about CSS selectors (unless they are
similar to jQuery selectors); should look into it.

> (I came to this subject from a very distant topic: make match-lambda
> fast. My idea is / was to specialize a set of match patterns to avoid
> the need to test multiple times with some kind of prefix compression.
> It is also related to generic methods: generic methods are like a
> match-lambda except each match clause is built in its own module.
> Similarly, it is possible to apply at compile-time prefix compression,
> because most generics vary at the beginning, which makes me think a
> general decision tree will be overkill; also arguments are passed as a
> list, a general decision tree will require to convert the list to a
> vector to be able to tap in the middle efficiently. And my plan is /
> was to construct a parser combinator from the match spec, specialize /
> optimize it, and then produce a lambda with nested-if that would
> dispatch to the correct underlying procedure. I think it is possible
> with Chez, using define-property that attaches an expression to an
> identifier and good use of implicit phasing.)
>
> That is why I started working on a parser combinator library called
> paco. I have attached to this mail a proof-of-concept that will stream
> a parse result, and possibly stream alternative parse results in the
> case where the grammar is ambiguous (not coded yet). It is a broad
> feature-set with many pitfalls. I already removed the ability to parse
> left-recursive grammars. And maybe handling ambiguous grammars is not
> useful as part of a clone of pandoc.

Great. Is the set of combinators similar to those used in other parser
combinator libraries, and would it be hard to write Markdown, AsciiDoc,
etc. parsers using them? My first hunch was to port existing parsers
from Chicken and Guile, but that's not necessarily easy.

In the long run, I believe in hand-rolled recursive-descent parsers (all
the most advanced parsers in big compilers like GCC seem to eventually
be written as recursive-descent). It seems parser generators are hard to
use to make really good parsers (good error messages, and avoiding weird
performance edge cases). A small Scheme subset with only stack
allocation would be a natural fit for writing these parsers.

> There is many test files in pandoc's github:
> https://github.com/jgm/pandoc/blob/master/test/
> But I think they miss the point of *unit* testing. I would rather
> approach the problem in a way that is similar to SRFI-180 (JSON).
>
> Another topic of interest while working on parsers is fuzzing. The de
> facto standard seems to be
> https://en.wikipedia.org/wiki/American_fuzzy_lop_(fuzzer)
> Another source of possibly interesting test cases is
> https://github.com/google/oss-fuzz
>
> I think my next goal is to rewrite SRFI-180 and HTTP 1.1 parser, then
> I will try to parse some markup language for a task we need a parser
> at work. Getting together a test suite as described above (piecewise
> tests) would be of great help for things like HTML, micro XML,
> commonmark, and also improve the test suite of HTTP 1.1 parser.
>
> Help and feedback are welcome!

All of those sound like good projects!