I define the indexation factor as the space disk overhead
that is added to the original data to allow to query "any" pattern
in the sense of nstore-from and nstore-where.
As of right now, the sample implementation rely on a result
from 1950 that reduce the indexation factor from the total
number of n permutations of (1, ..., n) which is in the case of
n = 6 [0] equal to 720. The result from [1] reduce 720 to 20,
but it is still the central binomial coefficient [2].
[0] This is the case of a versioned triple store
In practice, when n = 6, it means that given a data set of size X,
it requires X * 20 to host it in the nstore. For n = 3, indexation factor
is only 3. I consider this is still an awful lot of disk space especially
when you consider wikidata that is around 1TB and that is considered
in the field to be a small dataset.
The question is: how to reduce that indexation factor further?
tl;dr: use storage backend compression.
a) one indirection: hash tuple items using the secure (as of right now) sha256
that is 32 bytes and store in nstore the hash.
b) two indirections: hash tuple items using sha256, index them using an UUID
of 16 bytes and store in nstore the UUID.
c) Use an unique identifier allocator like [3]
Both a) and b) could possibly work if and only if most tuple items are much bigger
that 32 bytes (respectively 16 bytes), which I believe is not the case of wikidata.
And c) could work but okvs does not expose a "add-conflict-range-key" procedure,
that AFAIK can not reliably implemented with existing okvs libraries. Also as of
right now the allocator in foundationdb is meant to be used on "metadata" kind
of things, namely the directory layer. What I mean by that, is that even if you can
easily create "directories" on-the-fly, it is not meant to be used on every single value.
All the three "solutions" stated above, would break the following specified behaviour:
Note: In the case where there is single variable in PATTERN
the generator must be ordered.
Because there is no hash or unique identifier scheme that can possibly keep
the natural order of scheme objects in the general case. And fstore does not
support that. So, I am willing to drop that note from the specification.
I added that note, to allow to version space filling curves and allow "spatial
time travelling queries" and in general version the indexes with the original
data.
To explain it otherwise:
- nstore adopts a index-almost-all-the-things approach,
- nstore is good for a small set of data,
- nstore is good when you have lots of disk space and
that you don't know in advance which query will be useful,
as it is the case of wikidata.
Another solutions would be to avoid "index-almost-all-the-things"
and therefore reduce generality and restrict the kind queries that
can be executed. That is what datomic [4] (and eva [5]) does.
That is more work and that is not in the scope of this SRFI.
Current version of fstore code can be found at:
PS: There are compression algorithms available in some existing OKVS libraries.