Re: Cookbook is now scraped and ready to browse

Show/hide message thread

Cookbook is now scraped and ready to browse Lassi Kortela (08 May 2019 09:46 UTC)

Re: Cookbook is now scraped and ready to browse Arthur A. Gleckler (08 May 2019 16:10 UTC)

Re: Cookbook is now scraped and ready to browse Lassi Kortela (08 May 2019 16:44 UTC)

Re: Cookbook is now scraped and ready to browse Arthur A. Gleckler (08 May 2019 16:54 UTC)

Re: Cookbook is now scraped and ready to browse Lassi Kortela (08 May 2019 17:10 UTC)

Re: Cookbook is now scraped and ready to browse Lassi Kortela (08 May 2019 18:59 UTC)

Re: Cookbook is now scraped and ready to browse Arthur A. Gleckler (08 May 2019 21:15 UTC)

Re: Cookbook is now scraped and ready to browse Lassi Kortela (08 May 2019 17:28 UTC)

Re: Cookbook is now scraped and ready to browse Arthur A. Gleckler (08 May 2019 21:19 UTC)

Re: Cookbook is now scraped and ready to browse Lassi Kortela (08 May 2019 21:28 UTC)

Re: Cookbook is now scraped and ready to browse Lassi Kortela (08 May 2019 21:36 UTC)

Re: Cookbook is now scraped and ready to browse Lassi Kortela 08 May 2019 16:44 UTC

>     The git repo <https://github.com/schemedoc/cookbook>
>
> Very nice.  Thanks for finding a copy of that so that it can be
> preserved and perhaps revived.

That repo is just a scraped version of the archive from Wayback Machine.
The full archive is at <http://lassi.io/temp/schemecookbook.org.tgz>.
(Please grab a copy on your hard drive in case I ever lose it - Wayback
Machine downloader took a long while to gather all the files in it).

The scraper is in <scrape.rkt> and is simply based on HTML files
extracted from that tar file.

> It's interesting that none of the pages is a complete HTML page. They
> are all missing <html>, <head>, and <body> elements, for example.

The HTML pages in that tar file are complete. If you look at
<scrape.rkt> it uses an XPath expression to extract <div
id="contentbox"> only.

> the markup doesn't otherwise look like Pandoc markup.

It's almost certainly translated from TWiki markup. There are references
to TWiki all over the tar file. ("TWiki" sounds generic enough that
there could be several wiki engines bearing that name but it's probably
this one: <https://twiki.org/>.)

> In any case, Pandoc can be used to wrap them automatically, or we could use a simple
> script.

That could be done. HTML Tidy ('tidy' from the command line) is another
tool that can auto-wrap partial HTML. But maybe the easiest way to add
precisely the HTML we want is to parse those files into a SXML tree
using Scheme or Racket and then use Scheme code to process it. I've done
that in many scripts and it works very well. In fact, the current
<scrape.rkt> parses the original HTML from the tar file into SXML and
then dumps part of the SXML into the new HTML files in the 'wiki' directory.

(The original HTML had a lot of metadata and navigation and other stuff
so I took only the content. There were also tons of admin pages,
advertisement pages created by spambots, and some Erlang pages for some
reason, in the wiki; I left those out of the scrape.)

Before we try these out, we should decide whether to convert to another
markup language or stick with HTML. We'd eventually want to add a lot of
stuff if we can get the permission to do so, and HTML is somewhat
unwieldy to write by hand.

>     The Wiki was licensed under LGPL 2.1 by its original authors so I put
>     that license in our repo too.
>
> There's a refinement to that license in
> "Cookbook_CompilationCopyright.html".

Wow, good catch! I had missed that altogether. So each individual page
is LGPL but the compilation is "all rights reserved". Is that legally
possible? In any case, it'd be clearest to get permission from the
editors as you suggest.

> The first thing we should do is contact Anton Van Straaten, who is
> listed in "Cookbook_CookbookFAQ.html" as the creator.  There are also
> many references to the SchematicsEditorsGroup, but I haven't found a
> definition for that.  (There's no page of that name in the wiki.)

In the tar file there's a "SchematicsEditorsGroup" page. The members
from the latest revision of that page are: Brent Fulgham, Bruce
Butterfield, Francisco Solsona, Gordon Weakliem, Jens Axel Soegaard,
Neil Van Dyke, Noel Welsh, Anton Van Straaten.