Re: What libraries we need

Show/hide message thread

What libraries we need Lassi Kortela (07 Apr 2019 08:55 UTC)

Re: What libraries we need Peter Bex (07 Apr 2019 09:31 UTC)

URI/URL handling Lassi Kortela (07 Apr 2019 10:11 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 10:56 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 12:03 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 12:46 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 14:20 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 15:06 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 15:39 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 15:52 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 16:03 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 16:30 UTC)

Re: URI/URL handling Arthur A. Gleckler (09 Apr 2019 21:06 UTC)

Re: What libraries we need Arthur A. Gleckler (09 Apr 2019 20:49 UTC)

Re: What libraries we need Peter Bex 07 Apr 2019 09:31 UTC

Show/hide attachments

On Sun, Apr 07, 2019 at 11:55:47AM +0300, Lassi Kortela wrote:
> https://github.com/schemeweb/wiki/wiki/What-libraries-we-need
>
> - Do you find anything missing?

Just a note.  In CHICKEN, we have the uri-generic and uri-common eggs.
The latter builds on top of the former.  The reason for that is that
the URI spec (the RFC, not the poor excuse for a spec the W3C is working
on currently) differentiates between reserved characters and regular
ones.  The reserved ones (can) have a special meaning.  For example,
the query string decoding of ampersands, semicolons and equals characters
is something that's handled by the HTML spec, not the URI spec.

So, strictly speaking, according to the URI spec, foo?bar%3Dqux is
possibly different foo?bar=qux but it doesn't have to be.  Also,
the path /foo:bar/qux/ is possibly different from /foo%3Abar/qux/.

Of course, in most cases it's most convenient for the user to do full
decoding whenever possible.  The vast majority of users don't care about
the difference and want to treat both /foo:bar/qux/ and /foo%3Abar/qux/
as '(/ "foo:bar" "qux" "").  But if you are writing, say, a web proxy in
Scheme, it will be up to the upstream server how it handles these paths.

In CHICKEN we handle this by having the uri-generic egg parse as much as
it can without losing information.  So /foo%2Fbar/qux/ is decoded to
'(/ "foo/bar" "qux" ""), but in /foo%3Abar/qux/ the encoded chars are
left alone and are decoded to '(/ "foo%3Abar" "qux" "").  This also means
that a literal percent sign needs to stay encoded as %25.

Of course this is super-inconvenient, so in uri-common we decode fully
at the expense of losing information.  Most users will use uri-common in
their web code, because you rarely care about these encoded characters.

The situation is somewhat confusing and weird, but it turns out to be a
good compromise, because whenever you need the not-fully-decoded path,
you can access the underlying uri-generic object.  As long as you haven't
manipulated any component, you will get back the original input.

It would be nice if we can come up with cleaner API for this.
Regardless, I would recommend using the uri-generic parser
implementation for any reference implementation for a SRFI; it has a
large test suite and is super compliant with the RFC spec; moreso than
any other library I've come across in any language.  This is one library
I am extremely proud of being a co-maintainer for.

You can find the implementation in the CHICKEN subversion repo at [1].
You can also browse it online at [2].

Note that there are several alternative implementations using different
parser generators inside the "alternatives" directory.  The main one
still uses "matchable" and the implementation is a bit fiddly (but fast
as hell). There's one in irregex too (which could be easily ported to
SRFI-115) which comes close, performance-wise, and is a lot easier to
understand and maintain.

[1] https://code.call-cc.org/svn/chicken-eggs/release/5/uri-generic/trunk
[2] https://bugs.call-cc.org/browser/project/release/5/uri-generic/trunk

Cheers,
Peter