I define the indexation factor as the space disk overhead

that is added to the original data to allow to query "any" pattern

in the sense of nstore-from and nstore-where.

As of right now, the sample implementation rely on a result

from 1950 that reduce the indexation factor from the total

number of n permutations of (1, ..., n) which is in the case of

n = 6 [0] equal to 720. The result from [1] reduce 720 to 20,

but it is still the central binomial coefficient [2].

[0] This is the case of a versioned triple store

[1] https://math.stackexchange.com/a/3146674/23663

[2] https://en.wikipedia.org/wiki/Central_binomial_coefficient

In practice, when n = 6, it means that given a data set of size X,

it requires X * 20 to host it in the nstore. For n = 3, indexation factor

is only 3. I consider this is still an awful lot of disk space especially

when you consider wikidata that is around 1TB and that is considered

in the field to be a small dataset.

The question is: how to reduce that indexation factor further?

tl;dr: use storage backend compression.

a) one indirection: hash tuple items using the secure (as of right now) sha256

that is 32 bytes and store in nstore the hash.

b) two indirections: hash tuple items using sha256, index them using an UUID

of 16 bytes and store in nstore the UUID.

c) Use an unique identifier allocator like [3]

[3] https://ananthakumaran.in/2018/08/05/high-contention-allocator.html

Both a) and b) could possibly work if and only if most tuple items are much bigger

that 32 bytes (respectively 16 bytes), which I believe is not the case of wikidata.

And c) could work but okvs does not expose a "add-conflict-range-key" procedure,

that AFAIK can not reliably implemented with existing okvs libraries. Also as of

right now the allocator in foundationdb is meant to be used on "metadata" kind

of things, namely the directory layer. What I mean by that, is that even if you can

easily create "directories" on-the-fly, it is not meant to be used on every single value.

All the three "solutions" stated above, would break the following specified behaviour:

Note: In the case where there is single variable in PATTERN the generator must be ordered.

Because there is no hash or unique identifier scheme that can possibly keep

the natural order of scheme objects in the general case. And fstore does not

support that. So, I am willing to drop that note from the specification.

I added that note, to allow to version space filling curves and allow "spatial

time travelling queries" and in general version the indexes with the original

data.

To explain it otherwise:

- nstore adopts a index-almost-all-the-things approach,

- nstore is good for a small set of data,

- nstore is good when you have lots of disk space and

that you don't know in advance which query will be useful,

as it is the case of wikidata.

Another solutions would be to avoid "index-almost-all-the-things"

and therefore reduce generality and restrict the kind queries that

can be executed. That is what datomic [4] (and eva [5]) does.

That is more work and that is not in the scope of this SRFI.

[4] https://docs.datomic.com/cloud/query/raw-index-access.html

[5] https://github.com/Workiva/eva

Current version of fstore code can be found at:

https://git.sr.ht/~amz3/chez-scheme-arew/tree/master/src/arew/data/base/fstore.scm

PS: There are compression algorithms available in some existing OKVS libraries.

Amirouche ~ amz3 ~ https://hyper.dev