On 30/09/2019 16:09, Lassi Kortela wrote: >> databases tend to be set to one encoding > > Yes, but the percentage of data in a large database that actually uses > the claimed encoding is another matter :) I've dealt with huge databases > where there were continual character encoding problems. Likewise, people > who administer long-lived web forums can be heard complaining about it. This (wrongly-encoded text) is a bigger problem than just in databases (it happens in files on disk and things fetched via HTTP and friends, too), so I wonder if we should just approach it by: 1) Assuming that the supposed encoding is correct (whether it's hardcoded, configured, obtained from the DB, overridden per-column, etc depending on the driver) for getting strings to/from the DB, for normal use. 2) Also providing an override to read/write strings as bytevectors, for advanced use. I did wonder about providing some kind of library to "fix wrongly encoded strings" - eg, "this string was read from a byte vector as ISO-8859-1 but I now think it's actually UTF-8 because when I looked at it it looks like £ where I expect a pound sign", but in hindsight, that shouldn't be encouraged - some incorrect decodings might lose information so it might not be possible in general, and it would be far better to encourage keeping the byte strings around for any strings of unknown encoding, and re-decode them. >> but SQLite3 is more flexible. Suggestions on precisely what to do here >> are solicited, this is not a thing I know very well. > > If the DB engine encoding is known, and we're running in a Scheme > implementations like Gauche that supports many character encodings, we > can construct strings with the correct encoding right off the bat. Yeah! That's probably the easiest thing for users. As far as possible, create the illusion that strings are strings of characters, and their representation in bytes is an implementation detail. > In Schemes that use only one character encoding internally, and the > query results use a different encoding, we should probably return the > results as bytevectors instead of strings. Then the user can hook up > some charset conversion library if they want strings (or reconfigure > their database to emit the encoding used by the Scheme). Can't we include the charset conversion in the DBI and/or ask the database to do it by default? > DB engines are big pieces of software so maybe many of them have charset > conversion engines built in, and we can pick the encoding we want to > receive in the connection string (https://www.connectionstrings.com/). Yes. > Auto-conversion is very nice when it works. I suspect the problem domain > is complex enough that users should be able to configure what > auto-converter (if any) to use for each data type. Yeah, but with sensible defaults. > It can help in > scenarios that require high performance on big data sets to turn off > some unused conversions. Or if you're interfacing to a legacy DB that > has particularly weird data, some standard conversions may not work right. What IS more interesting is what the default behaviour for unknown types should be. In my dayjob, I write code in Go, which talks to a postgresql database via an ORM called GORM; the sort of thing that looks up the record types we declare in Go and creates SQL tables for them and makes a half-hearted attempt to automatically mapping to/from, occasionally stabbing you in the back (it's notoriously easy to accidentally delete all the records from a table in GORM by accident, but I digress). For any complicated-typed fields in a structure (eg, maps from strings to arrays of dates), we tend to ask GORM to just serialise them to JSON and store them in a strong column rather than trying to painstakingly create a bunch of other tables with foreign key relationships to store them in. In Scheme, what happens if the user tries to store an arbitrary SRFI-9 record value into a database column? My recommendation would be to forbid this outright (make them map it into something storable themselves) or register a custom type mapping with the DBI. Trying to be too clever seems to result in unexpected pain. -- Alaric Snell-Pym (M7KIT) http://www.snell-pym.org.uk/alaric/