Re: URI/URL handling - Simplelists

Show/hide message thread

What libraries we need Lassi Kortela (07 Apr 2019 08:55 UTC)

Re: What libraries we need Peter Bex (07 Apr 2019 09:31 UTC)

URI/URL handling Lassi Kortela (07 Apr 2019 10:11 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 10:56 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 12:03 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 12:46 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 14:20 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 15:06 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 15:39 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 15:52 UTC)

Re: URI/URL handling Peter Bex (07 Apr 2019 16:03 UTC)

Re: URI/URL handling Lassi Kortela (07 Apr 2019 16:30 UTC)

Re: URI/URL handling Arthur A. Gleckler (09 Apr 2019 21:06 UTC)

Re: What libraries we need Arthur A. Gleckler (09 Apr 2019 20:49 UTC)

Re: URI/URL handling Lassi Kortela 07 Apr 2019 12:02 UTC

>> We could have something like:
>>
>>      (uri-path     "/foo%3Abar/qux/") => (/ "foo:bar"   "qux")
>>      (uri-raw-path "/foo%3Abar/qux/") => (/ "foo%3Abar" "qux")
>
> I think this will work.  If you update the path, it will clobber the raw
> path, presumably?  Or should the code try hard to maintain components
> that weren't changed?

My intuition would suggest immutable URL objects. Things are much
simpler when transformations can go one way only (from generic URI to
common URI) and not the other way around. If it's a two-way street then
we have to make sure every kind of URL is correctly round-tripped. I
wouldn't do all that work unless it's specifically needed for some use case.

So users would build URIs from components by calling something like this:

     (make-generic-uri ...)
     (make-http-uri ...)
     (make-ftp-uri ...)
     (make-mailto-uri ...)

And those procedures would check that the caller gave syntactically
correct / unambiguously encodable URI components. The `make-generic-uri`
procedure would return a generic URI object, the others would return a
common URI object (from which you could still access the generic
versions of things via special accessor procedures).

> In uri-common, (uri-update uri path: '(/ "foo:bar" "mooh")) will cause
> the raw path to always be "/foo%3Abar/mooh", even if it was originally
> /foo:bar/qux (because the colon is or MAY BE special by a receiving
> server, and we don't want it to be treated specially).  If we tried hard
> we could detect that the prefix is unchanged (after normalization) and
> not touch it, but I think that's probably too much magic.
>
> Ideally there's a way to override this, because there are some servers
> out there which don't allow percent-encoded characters everywhere and
> insist on having the raw characters, even if those are not treated
> specially.

Great :D We should probably specify a conservative normalization form in
which differences like this don't matter...

> Also note that encoding of query strings is a whole topic unto itself.
> The W3C recommendation (in the HTML spec!) is that & is no longer used
> to separate query arguments.  Instead, servers should use ;.  The reasons
> behind that are pretty inane (because apparently for many people it's too
> hard to get the &amp; encoding right inside HTML), but the reality is
> that now many servers accept both & and ;, some still only accept & and
> there are probably servers that only accept ; too.  Search the URI-common
> code [1] for "application/x-www-form-urlencoded" for the gory details.
>
> In any case, uri-common opts to default to accepting both, but emitting
> semicolons by default.  However, this _must_ be overridable, because like
> I said, there are servers in the wild that don't accept semicolon-
> separated query strings.  It's a total shit show.  This is more of a
> client issue than a server issue, but a generic URI handling library
> needs to take it into account.

Ugh, good points once again :D I didn't even realize that URIs and query
strings are specified by different standards organizations.

Do you think we should supply fully parsed query strings into the URL
dispatcher? Does anyone actually dispatch by query string in practice?
I've never thought about that. On face value it seems too brittle.

Once again we could have the accessor procedure with the friendly and
obvious name give the conservatively pre-decoded query parameters, and
special accessors would give the raw query string or some other decoding.

> Note that in URI-generic, we encode /foo/bar/ as '(/ "foo" "bar" "").
> The empty string at the end indicates the trailing slash.  This makes a
> difference when running the relative uri resolution algorithm, so it is
> important to keep it.
>
> Something like "///" is also kept as a path consisting of three empty
> components in uri-generic.  Backslashes are not treated specially (but
> on Windows, you have to take care when converting such components to
> file system paths.  You have to do this for drive letters too, of course,
> and in UNIX you also have to remember you can have %2F-encoded slashes
> inside path components, which we keep.  Therefore, I think this is an
> orthogonal problem which could potentially be solved by a file system
> path library; a different SRFI altogether).
>
> For consecutive slashes, one could imagine a normalisation procedure
> which can optionally be called by the user.  But I don't think this
> should be done automatically, because one could have a path containing
> some identifier or tag, like /post/foo/edit, and it's not inconceivable
> that the path /post//edit would edit the post identified by be empty
> string.

Thanks for the detailed explanation. Turning consecutive slashes into
empty components ("") makes sense.

I think we should normalize the URLs that go into the router/dispatcher
because most people will not realize that they should think about these
edge cases. We can still let people access the non-normalized URL via
special accessor procedures if they want it.

By the same token, if the URL dispatcher captures URL components into
variables, the obvious way to write the URL specifications should be a
conservative one -- for example in this URL:

     ("document" int "comment" int "edit")

The `int` parser should not permit negative numbers or leading zeros
because many people will not realize they should consider the issue.

If these URL-component-into-variable parsers are strict and
conservative, that will also help catch errors due to (lack of) URL
normalization. E.g. if the "string" parser rejects blank strings (as it
probably should -- if a website's URL layout uses blank strings in URLs,
then someone is doing something too fancy with URLs) then it doesn't
matter if the URL parser keeps empty components.

In all parts of the API the boring, safe and ordinary way to do things
should be the obvious way to do it IMHO :)

> Indeed, they already have the same test suite and interface.  I started
> out with the alternatives because I did not like the current
> implementation due to it being hard to maintain (it's a direct port of a
> Haskell library which relies heavily on pattern matching/list
> destructuring which doesn't work so well in Scheme).  Then, I got hooked
> at trying out all the parser generators we have in CHICKEN for their
> convenience and performance.  I wrote up some notes about those too, but
> in the end I think I prefer the irregex-based one, which is a good
> tradeoff between the two.

That's excellent. There is a _lot_ of good work done in the Scheme
community. People just keep quiet about it :)

> Most servers do _not_ parse anything.  IMO this causes problems because
> different components in the stack have to implement their own parsing,
> and of course every component does it differently, leading to
> inconsistencies which may be exploitable as security problems, and it
> also seems to me to be less performant, because different components
> have to be re-parsing all the time.

These are very good points.

I compiled a wiki page with some links to what other languages are
doing:
<https://github.com/schemeweb/wiki/wiki/Request-abstraction-in-other-languages>.

It seems several of them take inspiration from CGI and split the URL
into server, port, path and query string. But they do not parse the path
and query string.

Your description convinces me that we should pass a fully parsed URL
object into the request handler (i.e. similar to what the uri-common
Chicken egg now gives). It would still have:

* the server and port parts intact, for people who want to know in the
handlers what virtual host they are on

* the username and password parts (is this a dumb idea?)

* the path as parsed by uri-generic, accessible with a separate
procedure from the "friendly path", i.e. without decoding colons and the
like