Re: RDBMS data types and NULL representation

Show/hide message thread
RDBMS data types and NULL representation Peter Bex (11 Sep 2019 08:08 UTC)
Re: RDBMS data types and NULL representation Lassi Kortela (11 Sep 2019 08:53 UTC)
Re: RDBMS data types and NULL representation Peter Bex (11 Sep 2019 09:30 UTC)
Partitioning the persistence problem space hga@xxxxxx (12 Sep 2019 12:29 UTC)
Re: Partitioning the persistence problem space Peter Bex (12 Sep 2019 12:45 UTC)
Re: Partitioning the persistence problem space hga@xxxxxx (12 Sep 2019 13:13 UTC)
Re: Partitioning the persistence problem space John Cowan (14 Sep 2019 03:59 UTC)
Re: RDBMS data types and NULL representation Peter Bex 11 Sep 2019 09:30 UTC
Show/hide attachments
On Wed, Sep 11, 2019 at 11:53:02AM +0300, Lassi Kortela wrote:
> Thanks for the detailed remarks Peter!

You're welcome, and thanks for the quick response!

> > In the CHICKEN ecosystem, we have a few DB eggs that people enjoy using.
> > The one that comes up most is sql-de-lite[1], which is indeed a delight
> > to work with, very elegant and high-level.  It offers a rather
> > "functional" approach with fold and map and such over rows.  I've modeled
> > the postgresql[2] egg on it.  I think these APIs are very Schemely and
> > not so "object-oriented" as you see in some other libraries.
>
> Some lesser-known Schemes have GDBM and ODBC support. Some SRFIs might make
> sense in this area. We could try to come up with a good partitioning of the
> problem domain to SRFIs before starting work on any of them. I only know
> basic SQL myself, so I'll leave this to others.

I think SQL isn't too relevant when considering the DB API itself, it's
mostly about how to set up a connection, how to send commands/prepare
statements and how to navigate result sets.

> I'm drafting a JSON SRFI. The most promising convention is the one you also
> propose: represent JSON arrays as Scheme vectors and JSON objects as alists.
> It would make sense to use the same mapping for all other hierarchical
> formats too.

Right, maybe the Postgres egg should use some other thing for arrays.
Not sure yet (see below - maybe it can use SRFI-25 arrays).

> What are Postgres tuples like - are they named tuples? Is it conventional or
> easy to get the names of the fields and use them, or is the field mapping
> implicit based on the order the values are written?

They're order-based.  In principle, they are also named (as a tuple is
typically a row from a table, which has column names), but there's
syntax for making an anonymous row:  SELECT (1, 2); will return an
anonymous row which has no names.  I think it maps to the generic "record"
type.

I don't think you can access the names directly, but you could look
them up via the types mapping (each returned row has column type OIDs
which can be looked up in the system catalogue).

> A HStore object looks just like a JSON object:
> <https://www.postgresql.org/docs/9.0/hstore.html>. Are there subtle
> differences, and can you nest JSON within a HStore and/or vice versa?

I think HStore been mostly superseded by JSON.  HStore contains
key/value pairs which are both strings, as far as I know.  So you can't
nest JSON inside it.  Vice versa wouldn't be possible anyway because JSON
has no way to represent this type of object.

> > And in MySQL/Mariadb I think you have only JSON but no
> > true arrays, which might make choosing vectors to represent arrays in
> > general a bit awkward
>
> In what way do you mean it is awkward?

In that SQL arrays map most naturally to Scheme vectors, but for
databases that don't have arrays but *do* have JSON types, it's annoying
that you can't use Scheme vectors for JSON arrays (assuming all databases
use the same mappings in this SRFI).

> > For now mapping arrays to vectors this seems to be a decent choice, but
> > there are some issues with Postgres array bounds.  For one, arrays are
> > one-based by default.  But they can also have custom bounds aren't
> > one-based.  For example, '[-1,2]={1,2,3}'::int[] constructs an array that
> > has a lower bound of -1 and an upper bound of 2.
>
> Like Pascal arrays. Can there be holes in the indexes?

No, values are always present.  If you extend an array, the missing slots
will be NULL.  However, arrays can be multidimensional.  So on second
thought perhaps arrays don't map so nicely to vectors, but rather to SRFI-25
arrays.  I used vectors because those are native in CHICKEN, and I didn't
want to introduce unnecessary dependencies.  Same for the datetime object
types; I don't want to drag in SRFI-19, which is a beast.

> Are Postgres arrays typed, untyped, or either? What's the difference between
> arrays, byte arrays and (different kinds of) strings?

Postgres arrays are always typed as far as I know.

Arrays are multi-dimensional objects of custom bounds, as explained above.
Byte arrays are basically SRFI-4/r7rs u8 bytevectors.  There is really only
one kind of string in Postgres, but it's covered by different types:

- text is the standard type
- varchar(n) is a text with a length constraint; can't be longer than n
- char(n) is a text with a fixed maximum length; padded with spaces
- citext is the same as text, but it uses case insensitive comparisons
   and thus a UNIQUE index will enforce a case-normalised uniqueness.

In MySQL, the story is a lot more confusing, because text has limitations
like you can't create an index on them, and there are still limits on the
length even though there shouldn't be (it has mediumtext and longtext and
so on.  It's a total shit show: https://mariadb.com/kb/en/library/text/).

In my Postgres egg, I ignore the different text types and simply map
everything to a string.  If it is inserted into a column, the DB will
give an error if the string is too long, and automatically pad it with
extra whitespace for char(n).  This does means read-write invariance is
broken for char(n); if you read back the string it will have extra spaces
added to the end.

> > Another important thing we should nail down early is what to do with the
> > SQL NULL type.  In CHICKEN, we have a separate sql-null[3] egg that deals
> > with these values, which is used by some (but, annoyingly, not all) eggs
> > as a common denominator.  Some interfaces use #f to indicate NULL, but
> > that causes trouble if the database in question also supports a boolean
> > type.
>
> For JSON we're going with the symbol `null` as the null indicator. Many
> Scheme implementations have a void or undef type, which might be better, but
> it's not as portable.

Using a symbol here is simple, but it does mean the symbol type can't be
used for other things, which is a bit of a shame.

Besides, a JSON NULL is not the same as an SQL NULL:

SELECT 'null'::json is null;
 ?column?
----------
 f
(1 row)

MySQL agrees:

mysql> SELECT CAST('null' AS JSON) is null;
+------------------------------+
| CAST('null' AS JSON) is null |
+------------------------------+
|                            0 |
+------------------------------+
1 row in set (0.00 sec)

This could cause subtle trickery if they were mapped to the same values.
SQL NULL indicates the absense of a value, which is not the same as a
JSON value of null (which isn't precisely the same thing).

I also think using void/undefined for SQL NULL is not a great move,
because that type isn't really standardised, and thus you can't test
for it with a predicate and you also can't really generate it for use
in an insert query or what have you.

Cheers,
Peter