SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents)

Show/hide message thread

Format of S-expression metadata for SRFI documents Lassi Kortela (11 Mar 2019 14:59 UTC)
Re: Format of S-expression metadata for SRFI documents Ciprian Dorin Craciun (11 Mar 2019 15:05 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (11 Mar 2019 16:02 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (11 Mar 2019 16:05 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (11 Mar 2019 16:02 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (11 Mar 2019 16:12 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (11 Mar 2019 17:30 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (11 Mar 2019 17:34 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (11 Mar 2019 17:49 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (11 Mar 2019 20:35 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (11 Mar 2019 18:09 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (11 Mar 2019 20:37 UTC)
Re: Format of S-expression metadata for SRFI documents John Cowan (11 Mar 2019 22:20 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (12 Mar 2019 07:08 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (12 Mar 2019 07:45 UTC)
(missing)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (12 Mar 2019 15:12 UTC)
ISO date format (Was: Format of S-expression metadata for SRFI documents) Lassi Kortela (12 Mar 2019 16:06 UTC)
Re: ISO date format (Was: Format of S-expression metadata for SRFI documents) Arthur A. Gleckler (12 Mar 2019 16:22 UTC)
Re: ISO date format (Was: Format of S-expression metadata for SRFI documents) Ciprian Dorin Craciun (12 Mar 2019 16:41 UTC)
Re: ISO date format (Was: Format of S-expression metadata for SRFI documents) Arthur A. Gleckler (12 Mar 2019 16:52 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (12 Mar 2019 09:43 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (12 Mar 2019 13:22 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (12 Mar 2019 17:02 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (12 Mar 2019 17:15 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (12 Mar 2019 17:35 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (12 Mar 2019 17:51 UTC)
Re: Format of S-expression metadata for SRFI documents Ciprian Dorin Craciun (13 Mar 2019 15:28 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (13 Mar 2019 17:01 UTC)
Re: Format of S-expression metadata for SRFI documents Ciprian Dorin Craciun (13 Mar 2019 15:41 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (13 Mar 2019 16:54 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (13 Mar 2019 17:17 UTC)
Re: Format of S-expression metadata for SRFI documents Ciprian Dorin Craciun (13 Mar 2019 21:57 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (14 Mar 2019 10:49 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (13 Mar 2019 17:46 UTC)
Re: Format of S-expression metadata for SRFI documents Ciprian Dorin Craciun (13 Mar 2019 18:53 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (14 Mar 2019 11:03 UTC)
Re: Format of S-expression metadata for SRFI documents Ciprian Dorin Craciun (14 Mar 2019 11:07 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (14 Mar 2019 11:12 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (14 Mar 2019 11:34 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (14 Mar 2019 17:24 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (14 Mar 2019 20:40 UTC)
Re: Format of S-expression metadata for SRFI documents Ciprian Dorin Craciun (13 Mar 2019 19:00 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (14 Mar 2019 13:28 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (14 Mar 2019 17:33 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (23 Mar 2019 10:35 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (23 Mar 2019 16:37 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (24 Mar 2019 09:15 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (24 Mar 2019 09:26 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (24 Mar 2019 09:27 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (25 Mar 2019 20:25 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (25 Mar 2019 22:04 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (25 Mar 2019 22:13 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (25 Mar 2019 22:42 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (25 Mar 2019 22:50 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (25 Mar 2019 22:54 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (25 Mar 2019 23:56 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (26 Mar 2019 00:16 UTC)
Re: Format of S-expression metadata for SRFI documents John Cowan (26 Mar 2019 01:27 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (26 Mar 2019 08:54 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (26 Mar 2019 04:17 UTC)
SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Lassi Kortela (26 Mar 2019 14:27 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Arthur A. Gleckler (26 Mar 2019 19:08 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Arthur A. Gleckler (26 Mar 2019 19:23 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Lassi Kortela (26 Mar 2019 20:26 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Ciprian Dorin Craciun (26 Mar 2019 20:48 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Arthur A. Gleckler (26 Mar 2019 23:18 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Lassi Kortela (27 Mar 2019 20:18 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Arthur A. Gleckler (27 Mar 2019 20:34 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Lassi Kortela (27 Mar 2019 21:20 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Arthur A. Gleckler (26 Mar 2019 23:10 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Lassi Kortela (26 Mar 2019 20:19 UTC)
Re: SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Arthur A. Gleckler (26 Mar 2019 23:06 UTC)
Re: Format of S-expression metadata for SRFI documents Göran Weinholt (26 Mar 2019 21:38 UTC)
Re: Format of S-expression metadata for SRFI documents Arthur A. Gleckler (26 Mar 2019 23:36 UTC)
Re: Format of S-expression metadata for SRFI documents Lassi Kortela (27 Mar 2019 21:42 UTC)

SRFI server infrastructure (Was: Format of S-expression metadata for SRFI documents) Lassi Kortela 26 Mar 2019 14:27 UTC

 > In this age of giant services like those of Amazon, Google, and
 > Facebook, it's easy to forget that our machines and networks are
 > incredibly powerful and fast, and that many of our data sets are
 > microscopic in comparison. Brute force solutions are not only easy
 > to implement and practical, but they are often the most useful ones,
 > too.

Fully agreed, but I would argue that those platform providers are the
ultimate triumph of brute force. An unfathomably powerful machinery is
invoked so we can do our little job in the easiest way possible: push
some code or data. From the consumer's point of view, this is the
least sophisticated way to run servers -- in a sense the McDonalds of
servers -- and that's what's liberating about it. The platforms are
not so much about speed as about convenience. The allure is skipping
system administration.

 > I'm happy to keep using Linode to serve the generated files. I
 > already have automatic backups, TLS/SSL certificates (through Let's
 > Encrypt), Nginx, and a web-based control panel. Running the SRFI
 > site costs me nothing in addition to running the several other sites
 > I host as well.

That's fine. I'm sure we can come up with a solution that will be easy
enough to host on Linode too. The main issue is security. (I wouldn't
be comfortable running a Scheme/Racket server with access to the full
Linux file system, so in that respect I sympathize with your approach
of preferring static files. And setting up Docker on your own server
to containerize Racket is only fun for people who love sysadmining.)

 > Continuing with the theme of simplicity, the metadata for all SRFI
 > combined should only require a few hundred kilobytes, especially
 > when compressed. Given that, I argue that clients should fetch the
 > whole thing once and search it locally, perhaps fetching new copies
 > occasionally by checking HTTP headers. [...] This has the benefit of
 > eliminating the latency of fetching results from a server. It also
 > makes clients less dependent on the network to get the data, and it
 > eliminates our need to run a server at all beyond serving static
 > files. As far as I can tell, that this would eliminate our need to
 > run a SQL server, too.

That's a great idea! The "official SRFI API" could just be a single
tar file that contains all the HTML and S-expression files. You're
already generating one with all the HTML, right? Compressed, it takes:

* 1.3 MiB - gzip --best
* 1.0 MiB - bzip2 --best
* 0.9 MiB - xz --best

This is a very manageable size for a download. It looks like fancier
compression than bzip2 doesn't bring any real savings to the table.
I'd vote for gzip since everything under the sun can decompress it.

 > Even adding metadata for all of R7RS and Snow
 > <http://snow-fort.org/> would not make such an approach impractical.

I'd love to have metadata for those as well, but I think things are
simplest to understand and manage if the SRFI process is
self-contained and the metadata for RnRS and libraries are curated
separately (even if some of the same people work on all of those
collections). It would probably be simplest for historical continuity
of the SRFI process over the years to have a very small footprint of
responsibilities and infrastructure. The publishing rhythm and
requirements of SRFI are also quite different to RnRS and libraries.

FWIW and off topic, I began extracting the RnRS argument lists here:
<https://github.com/lassik/scheme-rnrs-metadata>. It's incomplete but
I think they can be auto-extracted to the same standard as the SRFIs.
All RnRS documents used TeX with roughly the same semantic markup, so
you RnRS editors have laid a good foundation.

Eventually it's be cool to have that API aggregating all of these
collections, but that's yet another project :) If we can establish
social/technical processes in all the relevant communities to ensure
good source data, then aggregation should be easy.

 > The more I think about the webhook, the more I think it is too
 > complex. The SRFI repos change rarely, and I always have to run
 > manual commands to change them, anyway. Running one more command
 > doesn't add meaningfully to my workload, and then we don't have to
 > maintain Github webhooks, etc. Eliminating the webhook eliminates
 > any new dependency on Github, too. I'd prefer to drop that part of
 > the system and just have a simple command that will extract data
 > from the SRFI documents, combine that with some data that is
 > manually edited, and produce the combined files. Then we can
 > concentrate on the core value that we're providing to our users.

In their final form the HTML and metadata absolutely can be hosted
from anywhere. The tricky thing is the editing phase, especially if
volunteers send pull requests to amend the HTML of finalized SRFIs.
This is a difficult problem for sure.

Essentially: we have now almost solved the problem of metadata
extraction, and the key question has shifted to data consistency. We
know how to retrieve data from file systems, databases, HTTP, GitHub
origin repos, Git clones on our computers -- basically anywhere -- and
post-process it to make releases and serve them to the public in
various ways. Those things will work out one way or another.

The question is, how do we decide which data to retrieve?

Every time we generate a release, say "srfi-all.tar.gz", we'd like to
be sure that that release contains the latest valid versions of all
SRFIs and metadata. Tools can check that it's valid, but how do we
know that it's the latest stuff?

First we have to decide which place is blessed as the "point of truth"
where the latest sources are collected. Is it the GitHub origin repos
or the Git clones on the SRFI editor's personal computer? The
release-making tool will poll this place only.

If it's GitHub, then

1) We need to install a webhook or a CI job (Travis CI, etc.)
2) PRs can be checked by the same webhook/CI as the master branch.
    This is great for volunteers who send PRs.
3) The webhook/CI might as well push a new release after every merge
    to the master branch. So the release is always up to date with no
    human effort.
4) Installing CI jobs into 160+ repos is difficult. So with this
    approach we'd have to use 'git subtree' to make a mega-repo
    containing all SRFIs as subdirectories. Then the CI job would run
    in the mega-repo. Volunteers would probably also send their PRs to
    this repo.
5) A CI job could generate the "srfi-all.tar.gz" file and the
    dashboard page as static files, then push them to a static file
    server via SFTP, Amazon S3 API, etc. Deployment would be simple. A
    webhook server could also do this, but it could also serve that
    content itself since it's already a web server. To me, it doesn't
    matter all that much which approach is chosen here. Both are fine.

If it's the editor's personal computer, then

1) The editor should check that they have a clean working tree (have
    committed everything and pulled all the latest changes from GitHub)
    before making a release.
2) The editor has to make releases manually by running a script. To
    me, this raises the question of why not run that same script
    automatically via webhook/CI.
3) The editor has local Git clones of every SRFI so it's easy to get
    at all their files via the file system. This is a big plus for this
    approach.
4) On the other hand, it's still not much easier to check we have all
    the latest stuff before release (the script would have to poll all
    the GitHub origin repos).

Of course, there's the alternative of not having an automatic tool to
ensure we have the latest commits from all the SRFIs before release.
But since the automated approach doesn't seem substantially difficult
to me, I would favor it.

No matter which of the above approaches we choose, a major hurdle is
that the SRFIs are split into 160+ repos. All of the above would be
quite simple if they were all in one mega-repo because it's simple to
check for consistency and having the latest stuff (in GitHub, just set
up an ordinary Travis CI job -- on personal computer, just do one "git
pull" instead of a hundred).

That being said, I see the benefits of having a separate repo for each
SRFI. Particularly in the draft phase, so the author can clone only
their own SRFI and not be bothered by updates to the other ones.

It would seem that draft SRFIs and finalized SRFIs have strikingly
different requirements for effective workflow. Because draft SRFIs are
worked on individually, whereas finalized SRFIs are worked on in
batches. I didn't realize this at all until now! I think this is the
root cause of all the complexity in this hosting/release problem.

I personally think the GitHub organization webhook is the only
effective approach for ensuring consistency for massive amount of
repos (160+). It's still not foolproof because the server may fail to
respond to the webhook, which bugs me a little so it's not ideal.

Would it be impossible this far into the process to change the Git
conventions so that only draft SRFIs have their own repos under
<https://github.com/scheme-requests-for-implementation/> and
finalized/withdrawn SRFIs would be collected into one big repo? The
metadata and markup work could then happen in the big repo.

The key enabler here would be 'git subtree'. It allows each SRFI to be
a subdirectory in the big repo. Each subdirectory then mirrors 1:1 the
contents of that SRFIs individual repo (from which the draft came). If
there's ever a need to update the individual repo with changes from
the big repo, or vice versa, 'git subtree' allows copying commits in
both directions surprisingly easily. (If you copy commits from the big
repo to the small one, it will simply leave out all mention of files
outside the subtree, and discard commits that didn't touch the subtree
at all.)

So 'git subtree' means that we don't ever have to make an inescapable
commitment about how we lay out the repos. If we change our minds we
can copy commits between repos.

This is a lot to think about, but we could run experiments...

The really nice thing about the big repo is, we could run the release
tool in a bog standard free Travis CI job and Travis would give us
pull request checks with no effort (if we use a web server with a
webhook, implenting those PR checks takes the most effort). And tons
of developers are familiar with that Travis/Jenkins-style CI workflow
so there's nothing exotic about it. We also wouldn't have to deal with
Git or any GitHub API stuff -- Travis makes a Git shallow-clone of the
commit it needs, and then out tool just reads the local file system
without worrying about Git or databases. This also means that the tool
can run completely unchanged on a personal computer if we ever stop
using a CI system. No need to bundle web server and API/db stuff with
our tool. I'm beginning to warm up to the idea as well.

The only question with the Travis CI approach would be how to upload
the "srfi-all.tar.gz" and the dashboard web page to some static web
server. Apparently Travis can upload files to any server via SFTP
(<https://docs.travis-ci.com/user/deployment/custom/>) and to Amazon
S3 (<https://docs.travis-ci.com/user/deployment/s3/>).

Thoughts?