Re: Database connections as subprocesses

Show/hide message thread

Database connections as subprocesses Lassi Kortela (14 Sep 2019 07:30 UTC)
Re: Database connections as subprocesses John Cowan (15 Sep 2019 01:06 UTC)
Re: Database connections as subprocesses Lassi Kortela (15 Sep 2019 06:28 UTC)
Re: Database connections as subprocesses John Cowan (15 Sep 2019 23:02 UTC)
Re: Database connections as subprocesses Lassi Kortela (16 Sep 2019 08:22 UTC)
Binary S-expressions Lassi Kortela (16 Sep 2019 17:49 UTC)
(missing)
Re: Binary S-expressions Lassi Kortela (17 Sep 2019 09:46 UTC)
Re: Binary S-expressions Alaric Snell-Pym (17 Sep 2019 11:33 UTC)
Re: Binary S-expressions Lassi Kortela (17 Sep 2019 12:05 UTC)
Re: Binary S-expressions Alaric Snell-Pym (17 Sep 2019 12:23 UTC)
Re: Binary S-expressions Lassi Kortela (17 Sep 2019 13:20 UTC)
Re: Binary S-expressions Lassi Kortela (17 Sep 2019 13:48 UTC)
Re: Binary S-expressions Alaric Snell-Pym (17 Sep 2019 15:52 UTC)
Re: Binary S-expressions hga@xxxxxx (17 Sep 2019 16:25 UTC)
Re: Binary S-expressions rain1@xxxxxx (17 Sep 2019 09:28 UTC)
Re: Binary S-expressions Lassi Kortela (17 Sep 2019 10:05 UTC)
Python library for binary S-expressions Lassi Kortela (17 Sep 2019 21:51 UTC)
R7RS library for binary S-expressions Lassi Kortela (17 Sep 2019 23:56 UTC)
Re: Database connections as subprocesses Alaric Snell-Pym (16 Sep 2019 08:40 UTC)
Re: Database connections as subprocesses Lassi Kortela (16 Sep 2019 09:22 UTC)
Re: Database connections as subprocesses Alaric Snell-Pym (16 Sep 2019 11:28 UTC)
Re: Database connections as subprocesses hga@xxxxxx (16 Sep 2019 13:28 UTC)
Re: Database connections as subprocesses Lassi Kortela (16 Sep 2019 13:50 UTC)
Re: Database connections as subprocesses hga@xxxxxx (17 Sep 2019 13:59 UTC)
Re: Database connections as subprocesses John Cowan (16 Sep 2019 22:41 UTC)
Re: Database connections as subprocesses Lassi Kortela (17 Sep 2019 09:57 UTC)
Re: Database connections as subprocesses Lassi Kortela (17 Sep 2019 10:22 UTC)

Re: Database connections as subprocesses Lassi Kortela 16 Sep 2019 09:22 UTC

> Well, if it's a protocol, then drivers can be written in anything - I
> suspect the point was that lots of databases have C client libraries, so
> writing a protocol driver around them in C would be the easiest option.

Exactly :)

> But: I suspect this is a distraction. Linking to C libraries in-process
> versus writing a C wrapper that we talk to in a separate process is
> perhaps more about how an FFI to C is implemented, than inherent to the
> design of a database interface...

Indeed, this approach is more for situations where an implementation
doesn't have a FFI or you don't want to use it. I propose it as a
complementary approach to FFI wrappers, not as a replacement.

To the Scheme application programmer, it would be best if any subprocess
approach offers an identical (or nearly identical) interface to an
FFI-based approach. It would just be an implementation detail. Ideally
the Scheme interface would be specified so that it supports both
implementation styles (or mixing and matching them for different
databases, depending on what happens to be available or expedient on the
host system). We could start implementations using both approaches early
to make sure the spec takes all scenarios into account.

> I'd suggest we standardise a Schemely
> interface between database drivers and any intermediate infrastructure,
> then individual drivers could be:
>
> 1. SQLite-style in-process database engines in native Scheme (I can dream!)

There are some SQLite-style DBs in pure Go, but last I checked they
don't support SQLite files as backing stores. Not sure if they even
support SQL. I assume other languages are further behind (possibly Rust
is now Go's equal given all the excitement around it). SQLite is very
hard to replicate, and its test suite is legendary.

> 2. Scheme implementations of wire protocols for databases such as MySQL,
> PSQL, etc
> 3. Scheme adapters to in-process FFIs for libpq et al
> 4. Scheme code to talk a standardised subprocess protocol to external
> driver binaries, written in whatever language

Agreed. I like all of these alternatives. They suit different scenarios.

> ...(4) is a useful option, but I think one that needs to be defined
> separately from the Scheme-level API first, so that (1)-(3) can also be
> done. As a portable library, there will be value in having a
> separate-process driver for, eg, PostgreSQL that can be used in any
> scheme where subprocesses are a thing and the driver can be compiled
> (eg, on Linux), while schemes with a suitable FFI can choose to use a
> less-portable but more efficient binding to libpq, etc. Let a community
> of different drivers bloom!

This is exactly what I intended with the proposal :)

>> While I'm at it, how about this extremely simple protocol:
> [snip]
>> Comments?
>
> I'm not sure how the wire protocols work for MySQL, PostgreSQL, etc. but
> I seem to recall that the library interfaces at least in principle allow
> you to do something like:
>
> rs1 = query("SELECT * FROM very_big_table");
> rs2 = query("SELECT ... FROM t1, t2, t3, t4, t5 WHERE ...join
> conditions...");
>
> repeat {
>     r1 = rs1.read_row();
>     r2 = rs2.read_row();
>     ...some process involving r1 and r2...
> } until rs1.eof() || rs2.eof();
>
> ...without consuming unbounded local storage or deadlocking. Eg, have
> two queries streaming results back at the same time over the same
> "connection".

That's very cool, but probably also highly non-trivial to implement
reliably. Given that the subprocess thing is probably more suited to
situations that are not performance-sensitive, it might be best to go
for simplicity and permit only one query at a time.

As for the in-process implementations, if they are FFI-based I imagine
the underlying C library has to do some non-trivial poll and/or thread
stuff to juggle the concurrent queries. In this case, it might be most
reliable to use a fast Scheme that handles polling and threads in its
own runtime, and write a pure-Scheme driver for the Postgres/MySQL
socket protocol. That driver could then juggle the concurrent stuff,
relying on Scheme's safety guarantees.

> A few years back, I worked for a SQL database company; we had a wire
> protocol that our ODBC driver used to talk to the database servers, and
> the API structure was roughly like this:
>
> - StartQuery(query text) -> query ID
> - FetchResults(query ID) -> maybe(list(record))
> - GetMetadata(query ID) -> list(column name/type declaration)
> - CancelQuery(query ID)
>
> ...plus a bunch of boring server/connection-level metadata get/set
> operations, and some other stuff for setting up prepared statements that
> I never looked at.

Nice! So you've basically already done this before :) That command
language seems quite reasonable.

> But the key part was, you could issue a whole bunch of StartQuery
> operations then call FetchResults on the resulting query IDs with an
> arbitrary interleaving; each would return an arbitrarily-sized chunk of
> records, or Nothing if the query was finished and no more records would
> ever be returned. The ODBC client buffered those chunks as the
> application requested a record at a time, to amortize the network round
> trip overheads.

In principle, this would likely be relatively easy to do with a
subprocess (compared to threads and FFI). But since parallel queries are
for high-performance situations, people might want to go with the
pure-Scheme clients anyway. This is just speculation; since you've
actually worked with this stuff in a production setting, feel free to
offer counterpoints.

> (What that database server did at the backend is another, fascinating,
> story I summarised in a blog post:
> http://www.snell-pym.org.uk/archives/2016/12/11/cool-things-i-have-worked-on-clustered-analytic-database/
> )

That is some seriously impressive database wizardry.

> But: in general, I think a general database access protocol needs to
> support this, as some backends will be capable of doing that sort of
> query interleaving.

This should be addressed in the Scheme-side interface that is visible to
application programmers. Figuring out how to do it in a manner that is
portable across implementations will be interesting.. It may be best to
have two SRFIs: a simple serial one and a complex one that supports
interleaving.