Re: SRFI-metadata-syncing SRFI?

Show/hide message thread

SRFI-metadata-syncing SRFI? noosphere@xxxxxx (08 Nov 2020 21:50 UTC)
Re: SRFI-metadata-syncing SRFI? Vladimir Nikishkin (09 Nov 2020 01:00 UTC)
Re: SRFI-metadata-syncing SRFI? Lassi Kortela (09 Nov 2020 09:41 UTC)
Re: SRFI-metadata-syncing SRFI? noosphere@xxxxxx (09 Nov 2020 16:15 UTC)
Re: SRFI-metadata-syncing SRFI? Lassi Kortela (09 Nov 2020 16:36 UTC)
(missing)
Re: SRFI-metadata-syncing SRFI? noosphere@xxxxxx (09 Nov 2020 20:35 UTC)
Re: SRFI-metadata-syncing SRFI? Lassi Kortela (09 Nov 2020 20:57 UTC)
Re: SRFI-metadata-syncing SRFI? Lassi Kortela (09 Nov 2020 21:05 UTC)
Re: SRFI-metadata-syncing SRFI? noosphere@xxxxxx (09 Nov 2020 23:41 UTC)
Re: SRFI-metadata-syncing SRFI? Lassi Kortela (10 Nov 2020 07:53 UTC)
Re: SRFI-metadata-syncing SRFI? noosphere@xxxxxx (09 Nov 2020 23:45 UTC)
Re: SRFI-metadata-syncing SRFI? noosphere@xxxxxx (09 Nov 2020 20:50 UTC)
Re: SRFI-metadata-syncing SRFI? Lassi Kortela (09 Nov 2020 21:12 UTC)
The funnel pattern Lassi Kortela (09 Nov 2020 21:30 UTC)

Re: SRFI-metadata-syncing SRFI? Lassi Kortela 09 Nov 2020 16:36 UTC

> While workable, this seems to me to be less than ideal because any time
> one scrapes something the process is fragile, needing manual intervention
> to fix the scraper whenever some unforeseen change happens to the
> structure of what's being scraped.

The formats don't change all that much, and the big benefit of scrapers
is that they are executable documentation.

Before we had the current scrapers, there were various hand-written
listings of SRFI support around the net. It was impossible to tell where
they came from, how they had been assembled, and which parts were up to
date. There was no way for a newcomer to replicate the results.

Any large-scale data aggregation effort should absolutely use scrapers,
if only for documentation purposes. But it's also a good way to avoid
human error.

> There's also unnecessary bandwidth being wasted repeatedly downloading
> tar files, and time spent uncompressing and searching through them for
> what amounts to a relatively tiny bit of data.

These are non-issues. GitHub and GitLab have tons of bandwidth. Gambit
is one of the biggest Schemes and running listings/gambit-head.sh takes
only 4 seconds, including the time GitHub takes to generate us a
tailor-made tar archive of Gambit's git master.

We should scrape all this from different implementations and package
indexes, and aggregate it into one place where it's available as one
JSON and/or S-expression file.

But it pays to make a distinction between source data and aggregated
data. If one aggregator takes 5 seconds to scrape each source, it's not
a problem.

> Because there is no standard, the data you get from a an arbitrary
> Scheme's tar file is going to be unstructured, requiring more custom
> rules to extract it.
>
> Wouldn't it be so much simpler if every Scheme published the desired
> data in the desired format at could be directly, reliably consumed
> without having to write any custom code to deal with unstructured data
> in random locations?

It would, and this is most easily accomplished by adding an S-expression
file to each Scheme's git repo.

GitHub and GitLab can make you a link to each raw file stored in any Git
repo. E.g.
<https://raw.githubusercontent.com/schemedoc/implementation-metadata/master/schemes/chicken.scm>.
You can also change "master" in the URL to a different branch or tag. If
the aggregator could look for a standard file in the repo, it wouldn't
have to download the whole repo, and would take less than 1 second.

> Scanning through each package's metadata might be the most reliable way
> to do this, but there is still the question of how that metadata is made
> available.

The source metadata would be in each package. Each package manager would
scan all of its own packages and compile an index file. The aggregator
that compiles the full SRFI support table for all implementations would
then aggregate _that_ data :) We should serve the full table as HTML,
JSON, and S-expressions so people can save time and easily
machine-extract things directly from the full aggregated data.

> There is currently no standard, to my knowledge, of how
> packages are distributed or how their metadata is published.  Every
> Scheme does it in their own way.  This is an opportunity for
> standardization as well, with benefits to a metadata collection project.

All of that is correct.

> I would be happy help in the immediate future, though I'm afraid prior
> commitments might tear me away in the long run.

Don't worry about commitments; we can make a repo under
<https://github.com/pre-srfi> and gradually work on it.