Re: SRFI 14 test suites manipulating more than Latin1 characters?
Per Bothner 09 Jul 2023 21:15 UTC
On 7/9/23 13:40, Bradley Lucier wrote:
> I've been working on a SRFI 14 implementation manipulating the full Unicode character set, based partially on the data structure described here:
>
> https://mailman.iro.umontreal.ca/pipermail/gambit-list/2023-May/009665.html
>
> I guess I should have asked here first whether there are any existing implementations manipulating the full Unicode set. If so, please enlighten me.
Kawa has had an full-Unicode implementation of SRFI-14 since 2011,
contributed by Jamison Hope.
It uses an "inversion list" data structure which seems to be the same as
what you propose.
There are some tests in testsuite/lib-tets.scm, but they don't seem to go
much beyond Ascii, though they also test consisency.
A possible better data structure for lookups might be "Unicode trie":
The International Components for Unicode (ICU) project came up with a data structure based
on a Trie that provides fast access to Unicode metadata. The range data is precompiled to a
serialized and flattened trie, which is then used at runtime to lookup the necessary data.
According to my own tests, this is generally at least 50% faster than binary search, with
not too much additional memory required.
Quoted in https://github.com/foliojs/unicode-trie by Devon Govett, who did a port to JavaScript.
I did my own changes (https://github.com/PerBothner/unicode-properties) focused in what I need
for DomTerm: wide character and grapheme cluster boundaries detection. However, the basic
data structure and algorithms are easily modifiable to other character properties:
Basically, at build time src/generate.js constructs an array (with 32-bits per character)
using whatever Unicode property tables you are interested in, and then generate.js compiles that
to code that generates a compact trie. The latter is then used at run-time.
Note a single trie can encode 32-bits worth of properties, so it would probably
use less space. Converting the JavaScript to Scheme is left as an exercise for the reader.
However, while a Unicode trie is likely to be more efficient at looking up character properties,
it is likely to be awkward for srfi-14 set operations (char-set-union, char-set-intersection,
and the like). So using the Kawa inversion-list code may make most sense, unless someone
wants a data-structure challenge.
--
--Per Bothner
xxxxxx@bothner.com http://per.bothner.com/