Constructing master lists of data types hga@xxxxxx (30 Sep 2019 14:51 UTC)
Re: Constructing master lists of data types Lassi Kortela (30 Sep 2019 15:09 UTC)
Re: Constructing master lists of data types hga@xxxxxx (30 Sep 2019 18:00 UTC)
Re: Constructing master lists of data types John Cowan (02 Oct 2019 17:15 UTC)
How to store the master lists of data types hga@xxxxxx (02 Oct 2019 17:52 UTC)
Re: How to store the master lists of data types Arthur A. Gleckler (02 Oct 2019 21:10 UTC)
Re: How to store the master lists of data types Lassi Kortela (02 Oct 2019 21:31 UTC)
Re: How to store the master lists of data types hga@xxxxxx (02 Oct 2019 21:54 UTC)
Re: How to store the master lists of data types hga@xxxxxx (02 Oct 2019 21:42 UTC)
Re: How to store the master lists of data types Arthur A. Gleckler (03 Oct 2019 04:11 UTC)
Re: How to store the master lists of data types hga@xxxxxx (03 Oct 2019 12:27 UTC)
Re: How to store the master lists of data types Lassi Kortela (03 Oct 2019 14:55 UTC)
Re: How to store the master lists of data types Arthur A. Gleckler (03 Oct 2019 15:07 UTC)
Re: Constructing master lists of data types Alaric Snell-Pym (01 Oct 2019 09:11 UTC)
Re: Constructing master lists of data types John Cowan (30 Sep 2019 21:59 UTC)
Re: Constructing master lists of data types hga@xxxxxx (30 Sep 2019 22:14 UTC)
Re: Constructing master lists of data types John Cowan (01 Oct 2019 20:05 UTC)
Re: Constructing master lists of data types Alaric Snell-Pym (02 Oct 2019 16:15 UTC)
Re: Constructing master lists of data types Alaric Snell-Pym (01 Oct 2019 09:33 UTC)

Re: Constructing master lists of data types Alaric Snell-Pym 01 Oct 2019 09:11 UTC
On 30/09/2019 16:09, Lassi Kortela wrote:

>> databases tend to be set to one encoding
>
> Yes, but the percentage of data in a large database that actually uses
> the claimed encoding is another matter :) I've dealt with huge databases
> where there were continual character encoding problems. Likewise, people
> who administer long-lived web forums can be heard complaining about it.

This (wrongly-encoded text) is a bigger problem than just in databases
(it happens in files on disk and things fetched via HTTP and friends,
too), so I wonder if we should just approach it by:

1) Assuming that the supposed encoding is correct (whether it's
hardcoded, configured, obtained from the DB, overridden per-column, etc
depending on the driver) for getting strings to/from the DB, for normal use.

2) Also providing an override to read/write strings as bytevectors, for
advanced use.

I did wonder about providing some kind of library to "fix wrongly
encoded strings" - eg, "this string was read from a byte vector as
ISO-8859-1 but I now think it's actually UTF-8 because when I looked at
it it looks like £ where I expect a pound sign", but in hindsight, that
shouldn't be encouraged - some incorrect decodings might lose
information so it might not be possible in general, and it would be far
better to encourage keeping the byte strings around for any strings of
unknown encoding, and re-decode them.

>> but SQLite3 is more flexible. Suggestions on precisely what to do here
>> are solicited, this is not a thing I know very well.
>
> If the DB engine encoding is known, and we're running in a Scheme
> implementations like Gauche that supports many character encodings, we
> can construct strings with the correct encoding right off the bat.

Yeah! That's probably the easiest thing for users. As far as possible,
create the illusion that strings are strings of characters, and their
representation in bytes is an implementation detail.

> In Schemes that use only one character encoding internally, and the
> query results use a different encoding, we should probably return the
> results as bytevectors instead of strings. Then the user can hook up
> some charset conversion library if they want strings (or reconfigure
> their database to emit the encoding used by the Scheme).

Can't we include the charset conversion in the DBI and/or ask the
database to do it by default?

> DB engines are big pieces of software so maybe many of them have charset
> conversion engines built in, and we can pick the encoding we want to
> receive in the connection string (https://www.connectionstrings.com/).

Yes.

> Auto-conversion is very nice when it works. I suspect the problem domain
> is complex enough that users should be able to configure what
> auto-converter (if any) to use for each data type.

Yeah, but with sensible defaults.

> It can help in
> scenarios that require high performance on big data sets to turn off
> some unused conversions. Or if you're interfacing to a legacy DB that
> has particularly weird data, some standard conversions may not work right.

What IS more interesting is what the default behaviour for unknown types
should be.

In my dayjob, I write code in Go, which talks to a postgresql database
via an ORM called GORM; the sort of thing that looks up the record types
we declare in Go and creates SQL tables for them and makes a
half-hearted attempt to automatically mapping to/from, occasionally
stabbing you in the back (it's notoriously easy to accidentally delete
all the records from a table in GORM by accident, but I digress). For
any complicated-typed fields in a structure (eg, maps from strings to
arrays of dates), we tend to ask GORM to just serialise them to JSON and
store them in a strong column rather than trying to painstakingly create
a bunch of other tables with foreign key relationships to store them in.

In Scheme, what happens if the user tries to store an arbitrary SRFI-9
record value into a database column?

My recommendation would be to forbid this outright (make them map it
into something storable themselves) or register a custom type mapping
with the DBI. Trying to be too clever seems to result in unexpected pain.

--
Alaric Snell-Pym   (M7KIT)
http://www.snell-pym.org.uk/alaric/