Email list hosting service & mailing list manager


Scraping full text documentation into the API Lassi Kortela 12 Jul 2019 11:22 UTC

In my opinion, the API should eventually serve the full text of Scheme
documentation for RnRS, SRFI, implementations, and libraries. This will
make it easy to make various documentation readers (web-based,
devdocs.io / Dash.app style, Emacs, REPL, etc.)

This is a very ambitious goal, but I think useful, and I don't see any
harm in it. (Documentation is under an open source / free software
license, so we are allowed to copy it and serve it.) It doesn't matter
if it takes 2-3 years to get this to a useful state. We have time.

I'd just like to take this into account now so that the infrastructure
is ready. As far as I can tell, documentation can be scaped by the exact
same framework as introduced in my last email. I can't think of any
modifications that are needed. To start, we just need a Texinfo parser
(most Scheme documentation seems written in Texinfo) and figure out a
S-expression representation for it (Guile already has one that we can
maybe use). We can support other doc formats, but starting with Texinfo
is the fastest way to parse a lot of real docs.

The giant S-expression may eventually become like 100 MiB in size if
most Scheme documentation is scraped, but I don't think it matters.
There are countless approaches to storage (split into small files, use
binary files, use a database, etc. etc.) It's an implementation detail.

It would also be awesome to have full-text search of all the
documentation. Amirouche's search engine can probably be used by the
time we have a large amount of actual docs included.