Re: make json-stream-read a fold-like operation? Amirouche Boubekki 21 Jan 2020 15:38 UTC

Le mar. 21 janv. 2020 à 13:52, Duy Nguyen <xxxxxx@gmail.com> a écrit :
>
> On Tue, Jan 21, 2020 at 5:40 PM Amirouche Boubekki
> <xxxxxx@gmail.com> wrote:
> >
> > Le lun. 20 janv. 2020 à 15:18, John Cowan <xxxxxx@ccil.org> a écrit :
> > >
> > > LGTM
> > >
> > > On Mon, Jan 20, 2020 at 4:48 AM Duy Nguyen <xxxxxx@gmail.com> wrote:
> > >>
> > >> Is it possible to make 'proc' in json-stream-read take a third, opaque
> > >> object, and pass proc's result to the next 'proc' call?
> > >> json-stream-read returns the result of the last 'proc' call.
> > >>
> > >> I think by chaining these proc calls together, json-stream-read user
> > >> can pass parsing state along and can even avoid mutable states if they
> > >> want to. I haven't looked at the implementation though so I don't know
> > >> how hard to do it.
> >
> > That occured to me but I am wondering whether it would not be better
> > to make json-stream-read (possibly with a new name) return a
> > generator.
>
> I haven't used generators a lot (at least not in Scheme) so I can't
> really contribute anything here. With John's suggesting to go with
> generators in other parts of the srfi, I guess we might as well do
> generators here :)

With json-stream-read returning a generator, what you are asking can
be written as:

  (generator-fold PROC SEED (json-stream-read PORT))

ref: https://srfi.schemers.org/srfi-158/srfi-158.html

I do no mean that it does not have its place in the specification.  I
do not know how to create a good fold-like procedure.  GNU Guile has
something in this spirit for XML, but I never figured how it works
[0].

[0] https://www.gnu.org/software/guile/manual/html_node/SSAX.html

Another thing we could poke at, is something like JSONSlicer [1]. It
looks like the following:

  (json-read-slice selector port) -> generator

Where SELECTOR is some kind of json selector (somewhat like CSS
selectors). The generator would contain full Scheme objects like
json-read does, but for the subset described by SELECTOR.  It will
help in the cases, where you want only some parts of a big JSON.

That is the case of wikidata json dumps which is valid JSON, with a
top level JSON array, every array item is written on single line
ending with a command and newline.  So, the way I used to parse it was
to ignore the first line, a single open bracket (and ignore the last
line!!), then repeatedly call read-line, ignore the comma and newline
and parse what remains of the line as JSON text.  It is only a problem
because the file is JSON text instead of JSON lines and because the
file is very big, several Gigas.

In the case of wikidata json dump, the the procedure call would look
something like:

  (json-read-slice '(*) port)

Where * means every item of an array.  If one wants only the english
labels of all the concepts, it would look something like:

 (json-read-slice '(* labels english))

Otherwise, if one wants the label of the item indexed 42 in all languages:

  (json-read-slice '(42 labels))

What do you think about this slicer thing? I think it helps in the
cases of bigger than memory JSON text, but I only know about wikidata
use-case. jsonslicer is not very popular on github.  I did not mention
sxpath, the above SELECTOR argument looks like sxpath queries.

[1] https://github.com/AMDmi3/jsonslicer