Re: John Cowan's preliminary notes on a SQLite oriented Simple SQL (ssql) API Peter Bex (14 Sep 2019 09:41 UTC)
How to do transactions, especially nested hga@xxxxxx (18 Sep 2019 01:22 UTC)
Re: How to do transactions, especially nested Peter Bex (18 Sep 2019 06:05 UTC)
Re: How to do transactions, especially nested Lassi Kortela (18 Sep 2019 07:41 UTC)
Re: How to do transactions, especially nested Peter Bex (18 Sep 2019 07:58 UTC)
Re: How to do transactions, especially nested Lassi Kortela (18 Sep 2019 08:26 UTC)
Re: How to do transactions, especially nested Alaric Snell-Pym (18 Sep 2019 10:27 UTC)
Re: How to do transactions, especially nested Alaric Snell-Pym (18 Sep 2019 10:26 UTC)
Re: How to do transactions, especially nested hga@xxxxxx (18 Sep 2019 15:54 UTC)

Re: John Cowan's preliminary notes on a SQLite oriented Simple SQL (ssql) API Peter Bex 14 Sep 2019 09:41 UTC
On Sat, Sep 14, 2019 at 12:49:48AM -0400, John Cowan wrote:
> On Thu, Sep 12, 2019 at 8:11 AM Peter Bex <xxxxxx@more-magic.net> wrote:
> > - It seems unnecessarily complicated to me to have a special
> >    "transaction" object.  All SQL databases I know of have a
> >    transaction be a state of the connection.  So you can't
> >    create multiple transactions on the same connection.
>
> That was my original idea.  But I thought I'd future-proof it a little, as
> databases like Cockroach and Foundation start to appear that layer a SQL
> engine over ordered key-value stores that are transaction-first.  In most
> cases the transaction object would be the same as the database object.  I'm
> open to persuasion here, though.

It seems to me that the SQL layer only allows the same sort of transaction
management as other SQL engines support, simply because the primitives
are identical.  So while the underlying store may be able to do funky
things with transactions, the SQL layer will be limited to the same kind
of access as the other databases we've discussed.

I would consider it overengineering to add this "just in case it might
be needed", and it will make the API more error-prone.  I found this one
of the more confusing things of WebSQL, actually; that API was also
modeled to give you a transaction object, but you could never let the
object escape or use an outer transaction inside an inner transaction,
so there was really no point to it.

See https://www.w3.org/TR/webdatabase/

Let's not repeat their mistakes.

> > - I'm not sure giving blobs such prominence is needed.  I still
> >    haven't gotten around to implementing blob support in the
> >    postgresql egg, and I think I've never used blobs in my professional
> >    career, ever.  It's a bit of an antipattern to store file data in
> >    databases.
>
> Blobs are going to be important for representing more or less arbitrary
> Scheme data that doesn't have an external S-expression representation.  I
> have an idea, not really written down properly yet,  for using a variant of
> ASN.1 called LER (Lisp Encoding Rules) that provide binary representations
> of proper and improper lists, multi-dimensional arrays,  and other Lispy
> things.  A table of encodable types is at <http://tinyurl.com/asn1-ler>.

You don't the blob *API* just to store binary data.  Small binary data
can be stored in a BLOB or BYTEA column (which does not need a separate
API to access).

The blob API is intended to provide stream-like access to *large* binary
data.  You can more easily do partial updates for example and you can
fetch chunks of the data separately.

See https://www.postgresql.org/docs/current/largeobjects.html

> > - I would extend transaction support to include a "level".  Like I
> >   mentioned in my other mail, you can nest transactions via savepoints
> >   which can be individually rolled back or "committed" (which is a no-op).
>
> That's only partially nested transactions: you can have two levels but not
> three.

You can provide a nested API that goes up to any level (the Postgres egg
does this).  The important thing is that you always rollback savepoints
one by one.  Here's a nesting of three levels:

BEGIN TRANSACTION;
  ...
  SAVEPOINT s1;
    ...
    SAVEPOINT s2;
      ...
    RELEASE SAVEPOINT s2;
    ...
  RELEASE SAVEPOINT s1;
  ...
COMMIT TRANSACTION;

You can also rollback the innermost savepoint s2 without also rolling
back s1 or the surrounding "proper" transaction.  So as far as I'm
concerned, these semantics are a perfect fit for a nested
"with-transaction" operator.

Like I said, the only thing you can't do is change the isolation level
inside a savepoint.  That's reserved for the outer transaction.

> MySQL also has the problem that a DDL statement commits the current
> transaction (not true in SQLite).

We've already covered how hard MySQL sucks ;)  I think this is a
limitation that the API should not have to deal with.  And who knows,
they might add it in the future.

> And it's implementation-dependent whether a uncommitted inner
> transaction is visible to the outer transaction.  It's messy.

Can you clarify that?  AFAIK changes made in an inner transaction (via
savepoints) are always visible in the outer transaction after the
savepoint has been released.  And with the nested API I propose (where
transactions aren't first-class) you can't even access the outer
transaction before first committing or rolling back the inner
transaction (as it should be, IMO).

> > - The "meta" interface does not look very well-thought out.  This stuff
> >   is super complicated and the API seems too limited for proper
> >   introspection.
>
> I agree: it was a last-minute idea.  What matters for SQLite purposes (or
> at least for me) is to know what tables exist, and for an existing table,
> what it's column names are, their types, and their nullable status.  The
> rest is dispensible.

I think it's best to add this as an extension in the sqlite driver.
The introspection capabilities of various databases are so wildly
different that this is hard to standardise properly.

You could offer the "minimum subset" like above, but then how do you
access other properties?  Via another API?  That would be just ugly.

If this is absolutely necessary, I would propose having a more bulky API
that is more extensible.  For example:

(column-names TABLE) => list of strings
(column-types TABLE) => list of symbols (or strings?)
(column-nullable TABLE) => list of booleans

However, even this is tricky because with column-types, in Postgres I
would prefer being able to detect that array[int] is an array of type
int, without having to parse the type name manually.  And then there's
standard types like VARCHAR(n) where you would like to split the type
(VARCHAR) from the length property, or DECIMAL(N, M) where the type has
two properties.

This really requires much more thought.  Perhaps it could be a separate
SRFI.

> > - Drop the difference between statements and queries; in Postgres at
> >    least you can get a result set with INSERT...RETURNING, so the
> >    difference is moot.  Everything can return a result set.
>
> I'm okay with saying that INSERT...RETURNING should be treated as a query.
> The difference is between operations that produce a stream of results and
> those that do not.  Obviously a DDL statement, a delete, or whatever, could
> be treated as producing a stream of results that is always empty, but what
> for?

For simplicity and elegance.  I would rather see a reason for having two
rather than one.  I mean, in Scheme we don't really distinguish between
expressions and statements either, and it buys us a lot of generality.

> > - Perhaps represent result sets as opaque random-access objects?
>
> I don't think random access makes sense, especially for large result sets.
> The idea is that a statement can be executed if you don't care about its
> results, but if you do, you get a query object that acts as a generator.

In practice, result sets are returned in their entirety to the client.
The reason for this is that you don't want a round trip to the server
and back every time you access a row or column.  Also, the server could
go away (be restarted or whatever) and then you're stuck without being
able to access the result set anymore.

If you don't want large result sets to be instantiated completely, you
will need to create a server-side cursor and use that to do random access
on the result set.  Then the partial result set you fetch from the
cursor is returned as a result set object to the client.  And that result
set object is random-access because it's simply stored as a large vector,
basically.

So no matter the size of the result set, it will always be allocated
in-memory.  In the Postgres egg, I debated what to do with this because
I didn't really want this ugly opaque object to be there in-memory.

But converting it to Scheme objects directly meant two things:

- I'd need some way to store the metadata associated with the result set.
- While converting to Scheme objects, you end up doubling your memory
  usage.  This is because you can only release the result set object
  at once.

Now presumably, the latter issue wouldn't be a problem for a native
Scheme implementation of the wire protocol.  But you still have to
store the metadata somewhere.

> > Those
> >    can be converted to Scheme lists as needed.  This could provide some
> >    efficiency and makes it easier to access meta data like data types of
> >    the columns in the result set etc.
> >
>
> I've been thinking about whether it's better to return a row as a list, as
> currently, or as an a-list mapping names to values.  The latter is
> definitely easier to use, but also more work.

It's more work to generate, and requires quite a lot more memory to
store.  And finally, traversing it would be more work as well because
you need to use one of the assoc procedures which traverse the list,
while if you know the index you can just grab the field at the place.
Result sets *are* conceptually lists of tuples, where the column index
is known and relevant.

Besides, something like SELECT 1, 2 has no column names.  And you can
even use the same alias multiple times: SELECT 1 AS foo, 2 AS foo; does
not produce an error.  So you'd still need to know the position of the
column to access a specific one (but this is an edge case that might
not be worth thinking about).

I've debated this a lot in the postgresql egg, because accessing columns
by index frankly sucks.  But in high-performance situations, looking up
the index once and then accessing it directly is quite useful.  And I
also came to the conclusion that when you are accessing rows that you
don't need some of the columns for, you're probably selecting too many
columns, which is an antipattern when writing SQL by hand.

So the API in the postgresql egg is:

(row-fold* (lambda (x y result) ...) 0 (query "SELECT x, y"))

The unstarred version works the same but passes a list to the
procedure:

(row-fold (lambda (row result) (let ((x (car row)) (y (cadr row))) ...))
          0 (query "SELECT x, y"))

Not super happy with that yet.  Maybe the unstarred row-fold would have
been better off passing a vector to the lambda.  Providing a version
which passes in an alist or hash table might also be an option.

Cheers,
Peter