Re: URI/URL handling - Simplelists

Show/hide message thread

What libraries we need Lassi Kortela (07 Apr 2019 08:55 UTC)

Re: What libraries we need Peter Bex (07 Apr 2019 09:31 UTC)

URI/URL handling Lassi Kortela (07 Apr 2019 10:11 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 10:56 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 12:03 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 12:46 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 14:20 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 15:06 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 15:39 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 15:52 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 16:03 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 16:30 UTC)

Re: URI/URL handling Arthur A. Gleckler (09 Apr 2019 21:06 UTC)

Re: What libraries we need Arthur A. Gleckler (09 Apr 2019 20:49 UTC)

Re: URI/URL handling Peter Bex 07 Apr 2019 10:56 UTC

Show/hide attachments

On Sun, Apr 07, 2019 at 01:11:45PM +0300, Lassi Kortela wrote:
> Thanks for the great comments Peter! I for one love working with people who
> care about getting things right at this level of detail.

This is what I like about the Scheme community; we care about getting
things right :)

> > It would be nice if we can come up with cleaner API for this.
>
> In the archive file interface, I do this:
>
>     (archive-entry-path entry)     => safe normalized pathname as list
>     (archive-entry-raw-path entry) => raw unsafe pathname as bytevector
>
> I've generally had good experiences this kind of API. I.e. the procedure
> with the short and obvious name returns the thing people usually want, and
> there's a separate procedure to get the raw/unsafe/complex version.
>
> We could have something like:
>
>     (uri-path     "/foo%3Abar/qux/") => (/ "foo:bar"   "qux")
>     (uri-raw-path "/foo%3Abar/qux/") => (/ "foo%3Abar" "qux")

I think this will work.  If you update the path, it will clobber the raw
path, presumably?  Or should the code try hard to maintain components
that weren't changed?

In uri-common, (uri-update uri path: '(/ "foo:bar" "mooh")) will cause
the raw path to always be "/foo%3Abar/mooh", even if it was originally
/foo:bar/qux (because the colon is or MAY BE special by a receiving
server, and we don't want it to be treated specially).  If we tried hard
we could detect that the prefix is unchanged (after normalization) and
not touch it, but I think that's probably too much magic.

Ideally there's a way to override this, because there are some servers
out there which don't allow percent-encoded characters everywhere and
insist on having the raw characters, even if those are not treated
specially.

Also note that encoding of query strings is a whole topic unto itself.
The W3C recommendation (in the HTML spec!) is that & is no longer used
to separate query arguments.  Instead, servers should use ;.  The reasons
behind that are pretty inane (because apparently for many people it's too
hard to get the &amp; encoding right inside HTML), but the reality is
that now many servers accept both & and ;, some still only accept & and
there are probably servers that only accept ; too.  Search the URI-common
code [1] for "application/x-www-form-urlencoded" for the gory details.

In any case, uri-common opts to default to accepting both, but emitting
semicolons by default.  However, this _must_ be overridable, because like
I said, there are servers in the wild that don't accept semicolon-
separated query strings.  It's a total shit show.  This is more of a
client issue than a server issue, but a generic URI handling library
needs to take it into account.

> By the way, what about paths that contain more than one consecutive slash:
> e.g. (uri-path "///")? And relative paths that don't start with a slash?
> What happens then a URI path contains a backslash?

Note that in URI-generic, we encode /foo/bar/ as '(/ "foo" "bar" "").
The empty string at the end indicates the trailing slash.  This makes a
difference when running the relative uri resolution algorithm, so it is
important to keep it.

Something like "///" is also kept as a path consisting of three empty
components in uri-generic.  Backslashes are not treated specially (but
on Windows, you have to take care when converting such components to
file system paths.  You have to do this for drive letters too, of course,
and in UNIX you also have to remember you can have %2F-encoded slashes
inside path components, which we keep.  Therefore, I think this is an
orthogonal problem which could potentially be solved by a file system
path library; a different SRFI altogether).

For consecutive slashes, one could imagine a normalisation procedure
which can optionally be called by the user.  But I don't think this
should be done automatically, because one could have a path containing
some identifier or tag, like /post/foo/edit, and it's not inconceivable
that the path /post//edit would edit the post identified by be empty
string.

> From your description, it sounds like you did exactly the right thing on all
> counts.

We really try!

> > Note that there are several alternative implementations using different
> > parser generators inside the "alternatives" directory.  The main one
> > still uses "matchable" and the implementation is a bit fiddly (but fast
> > as hell). There's one in irregex too (which could be easily ported to
> > SRFI-115) which comes close, performance-wise, and is a lot easier to
> > understand and maintain.
>
> Could we specify a common interface for these implementations (or do they
> already have the same interface)? This means they can also share the same
> test suite, which ensures they are interchangeable (except for speed and
> compatibility).

Indeed, they already have the same test suite and interface.  I started
out with the alternatives because I did not like the current
implementation due to it being hard to maintain (it's a direct port of a
Haskell library which relies heavily on pattern matching/list
destructuring which doesn't work so well in Scheme).  Then, I got hooked
at trying out all the parser generators we have in CHICKEN for their
convenience and performance.  I wrote up some notes about those too, but
in the end I think I prefer the irregex-based one, which is a good
tradeoff between the two.

> The request abstraction could be specified so that it just gets the raw URL
> as a string from the HTTP server. The the application could parse it before
> passing it on to the router/dispatcher (or the r/d could call the library to
> parse it). But is it more convenient if the request object already contains
> the parsed URL? Do e.g. Apache of Nginx module get pre-parsed URLs from
> those web servers? In that case it would probably not make sense to parse it
> again ourselves.

Most servers do _not_ parse anything.  IMO this causes problems because
different components in the stack have to implement their own parsing,
and of course every component does it differently, leading to
inconsistencies which may be exploitable as security problems, and it
also seems to me to be less performant, because different components
have to be re-parsing all the time.

[1] http://bugs.call-cc.org/browser/release/5/uri-common/trunk/uri-common.scm#L237

Cheers,
Peter