Re: SRFI 14 test suites manipulating more than Latin1 characters?

Show/hide message thread

SRFI 14 test suites manipulating more than Latin1 characters? Bradley Lucier (09 Jul 2023 20:40 UTC)

Re: SRFI 14 test suites manipulating more than Latin1 characters? Per Bothner (09 Jul 2023 21:15 UTC)

Re: SRFI 14 test suites manipulating more than Latin1 characters? Alex Shinn (16 Aug 2023 19:51 UTC)

Re: SRFI 14 test suites manipulating more than Latin1 characters? Per Bothner 09 Jul 2023 21:15 UTC

On 7/9/23 13:40, Bradley Lucier wrote:
> I've been working on a SRFI 14 implementation manipulating the full Unicode character set, based partially on the data structure described here:
>
> https://mailman.iro.umontreal.ca/pipermail/gambit-list/2023-May/009665.html
>
> I guess I should have asked here first whether there are any existing implementations manipulating the full Unicode set.  If so, please enlighten me.

Kawa has had an full-Unicode implementation of SRFI-14 since 2011,
contributed by Jamison Hope.

It uses an "inversion list" data structure which seems to be the same as
what you propose.

There are some tests in testsuite/lib-tets.scm, but they don't seem to go
much beyond Ascii, though they also test consisency.

A possible better data structure for lookups might be "Unicode trie":
     The International Components for Unicode (ICU) project came up with a data structure based
     on a Trie that provides fast access to Unicode metadata. The range data is precompiled to a
     serialized and flattened trie, which is then used at runtime to lookup the necessary data.
     According to my own tests, this is generally at least 50% faster than binary search, with
     not too much additional memory required.
Quoted in https://github.com/foliojs/unicode-trie by Devon Govett, who did a port to JavaScript.

I did my own changes (https://github.com/PerBothner/unicode-properties) focused in what I need
for DomTerm: wide character and grapheme cluster boundaries detection.  However, the basic
data structure and algorithms are easily modifiable to other character properties:
Basically, at build time src/generate.js constructs an array (with 32-bits per character)
using whatever Unicode property tables you are interested in, and then generate.js compiles that
to code that generates a compact trie. The latter is then used at run-time.
Note a single trie can encode 32-bits worth of properties, so it would probably
use less space. Converting the JavaScript to Scheme is left as an exercise for the reader.

However, while a Unicode trie is likely to be more efficient at looking up character properties,
it is likely to be awkward for srfi-14 set operations (char-set-union, char-set-intersection,
and the like).  So using the Kawa inversion-list code may make most sense, unless someone
wants a data-structure challenge.
--
	--Per Bothner
xxxxxx@bothner.com   http://per.bothner.com/