Cookbook is now scraped and ready to browse
Lassi Kortela
(08 May 2019 09:46 UTC)
|
Re: Cookbook is now scraped and ready to browse
Arthur A. Gleckler
(08 May 2019 16:10 UTC)
|
Re: Cookbook is now scraped and ready to browse Lassi Kortela (08 May 2019 16:44 UTC)
|
Re: Cookbook is now scraped and ready to browse
Arthur A. Gleckler
(08 May 2019 16:54 UTC)
|
Re: Cookbook is now scraped and ready to browse
Lassi Kortela
(08 May 2019 17:10 UTC)
|
Re: Cookbook is now scraped and ready to browse
Lassi Kortela
(08 May 2019 18:59 UTC)
|
Re: Cookbook is now scraped and ready to browse
Arthur A. Gleckler
(08 May 2019 21:15 UTC)
|
Re: Cookbook is now scraped and ready to browse
Lassi Kortela
(08 May 2019 17:28 UTC)
|
Re: Cookbook is now scraped and ready to browse
Arthur A. Gleckler
(08 May 2019 21:19 UTC)
|
Re: Cookbook is now scraped and ready to browse
Lassi Kortela
(08 May 2019 21:28 UTC)
|
Re: Cookbook is now scraped and ready to browse
Lassi Kortela
(08 May 2019 21:36 UTC)
|
> The git repo <https://github.com/schemedoc/cookbook> > > Very nice. Thanks for finding a copy of that so that it can be > preserved and perhaps revived. That repo is just a scraped version of the archive from Wayback Machine. The full archive is at <http://lassi.io/temp/schemecookbook.org.tgz>. (Please grab a copy on your hard drive in case I ever lose it - Wayback Machine downloader took a long while to gather all the files in it). The scraper is in <scrape.rkt> and is simply based on HTML files extracted from that tar file. > It's interesting that none of the pages is a complete HTML page. They > are all missing <html>, <head>, and <body> elements, for example. The HTML pages in that tar file are complete. If you look at <scrape.rkt> it uses an XPath expression to extract <div id="contentbox"> only. > the markup doesn't otherwise look like Pandoc markup. It's almost certainly translated from TWiki markup. There are references to TWiki all over the tar file. ("TWiki" sounds generic enough that there could be several wiki engines bearing that name but it's probably this one: <https://twiki.org/>.) > In any case, Pandoc can be used to wrap them automatically, or we could use a simple > script. That could be done. HTML Tidy ('tidy' from the command line) is another tool that can auto-wrap partial HTML. But maybe the easiest way to add precisely the HTML we want is to parse those files into a SXML tree using Scheme or Racket and then use Scheme code to process it. I've done that in many scripts and it works very well. In fact, the current <scrape.rkt> parses the original HTML from the tar file into SXML and then dumps part of the SXML into the new HTML files in the 'wiki' directory. (The original HTML had a lot of metadata and navigation and other stuff so I took only the content. There were also tons of admin pages, advertisement pages created by spambots, and some Erlang pages for some reason, in the wiki; I left those out of the scrape.) Before we try these out, we should decide whether to convert to another markup language or stick with HTML. We'd eventually want to add a lot of stuff if we can get the permission to do so, and HTML is somewhat unwieldy to write by hand. > The Wiki was licensed under LGPL 2.1 by its original authors so I put > that license in our repo too. > > There's a refinement to that license in > "Cookbook_CompilationCopyright.html". Wow, good catch! I had missed that altogether. So each individual page is LGPL but the compilation is "all rights reserved". Is that legally possible? In any case, it'd be clearest to get permission from the editors as you suggest. > The first thing we should do is contact Anton Van Straaten, who is > listed in "Cookbook_CookbookFAQ.html" as the creator. There are also > many references to the SchematicsEditorsGroup, but I haven't found a > definition for that. (There's no page of that name in the wiki.) In the tar file there's a "SchematicsEditorsGroup" page. The members from the latest revision of that page are: Brent Fulgham, Bruce Butterfield, Francisco Solsona, Gordon Weakliem, Jens Axel Soegaard, Neil Van Dyke, Noel Welsh, Anton Van Straaten.