What libraries we need
Lassi Kortela
(07 Apr 2019 08:55 UTC)
|
Re: What libraries we need
Peter Bex
(07 Apr 2019 09:31 UTC)
|
URI/URL handling
Lassi Kortela
(07 Apr 2019 10:11 UTC)
|
Re: URI/URL handling Peter Bex (07 Apr 2019 10:56 UTC)
|
Re: URI/URL handling
Lassi Kortela
(07 Apr 2019 12:03 UTC)
|
Re: URI/URL handling
Lassi Kortela
(07 Apr 2019 12:46 UTC)
|
Re: URI/URL handling
Peter Bex
(07 Apr 2019 14:20 UTC)
|
Re: URI/URL handling
Lassi Kortela
(07 Apr 2019 15:06 UTC)
|
Re: URI/URL handling
Peter Bex
(07 Apr 2019 15:39 UTC)
|
Re: URI/URL handling
Lassi Kortela
(07 Apr 2019 15:52 UTC)
|
Re: URI/URL handling
Peter Bex
(07 Apr 2019 16:03 UTC)
|
Re: URI/URL handling
Lassi Kortela
(07 Apr 2019 16:30 UTC)
|
Re: URI/URL handling
Arthur A. Gleckler
(09 Apr 2019 21:06 UTC)
|
Re: What libraries we need
Arthur A. Gleckler
(09 Apr 2019 20:49 UTC)
|
On Sun, Apr 07, 2019 at 01:11:45PM +0300, Lassi Kortela wrote: > Thanks for the great comments Peter! I for one love working with people who > care about getting things right at this level of detail. This is what I like about the Scheme community; we care about getting things right :) > > It would be nice if we can come up with cleaner API for this. > > In the archive file interface, I do this: > > (archive-entry-path entry) => safe normalized pathname as list > (archive-entry-raw-path entry) => raw unsafe pathname as bytevector > > I've generally had good experiences this kind of API. I.e. the procedure > with the short and obvious name returns the thing people usually want, and > there's a separate procedure to get the raw/unsafe/complex version. > > We could have something like: > > (uri-path "/foo%3Abar/qux/") => (/ "foo:bar" "qux") > (uri-raw-path "/foo%3Abar/qux/") => (/ "foo%3Abar" "qux") I think this will work. If you update the path, it will clobber the raw path, presumably? Or should the code try hard to maintain components that weren't changed? In uri-common, (uri-update uri path: '(/ "foo:bar" "mooh")) will cause the raw path to always be "/foo%3Abar/mooh", even if it was originally /foo:bar/qux (because the colon is or MAY BE special by a receiving server, and we don't want it to be treated specially). If we tried hard we could detect that the prefix is unchanged (after normalization) and not touch it, but I think that's probably too much magic. Ideally there's a way to override this, because there are some servers out there which don't allow percent-encoded characters everywhere and insist on having the raw characters, even if those are not treated specially. Also note that encoding of query strings is a whole topic unto itself. The W3C recommendation (in the HTML spec!) is that & is no longer used to separate query arguments. Instead, servers should use ;. The reasons behind that are pretty inane (because apparently for many people it's too hard to get the & encoding right inside HTML), but the reality is that now many servers accept both & and ;, some still only accept & and there are probably servers that only accept ; too. Search the URI-common code [1] for "application/x-www-form-urlencoded" for the gory details. In any case, uri-common opts to default to accepting both, but emitting semicolons by default. However, this _must_ be overridable, because like I said, there are servers in the wild that don't accept semicolon- separated query strings. It's a total shit show. This is more of a client issue than a server issue, but a generic URI handling library needs to take it into account. > By the way, what about paths that contain more than one consecutive slash: > e.g. (uri-path "///")? And relative paths that don't start with a slash? > What happens then a URI path contains a backslash? Note that in URI-generic, we encode /foo/bar/ as '(/ "foo" "bar" ""). The empty string at the end indicates the trailing slash. This makes a difference when running the relative uri resolution algorithm, so it is important to keep it. Something like "///" is also kept as a path consisting of three empty components in uri-generic. Backslashes are not treated specially (but on Windows, you have to take care when converting such components to file system paths. You have to do this for drive letters too, of course, and in UNIX you also have to remember you can have %2F-encoded slashes inside path components, which we keep. Therefore, I think this is an orthogonal problem which could potentially be solved by a file system path library; a different SRFI altogether). For consecutive slashes, one could imagine a normalisation procedure which can optionally be called by the user. But I don't think this should be done automatically, because one could have a path containing some identifier or tag, like /post/foo/edit, and it's not inconceivable that the path /post//edit would edit the post identified by be empty string. > From your description, it sounds like you did exactly the right thing on all > counts. We really try! > > Note that there are several alternative implementations using different > > parser generators inside the "alternatives" directory. The main one > > still uses "matchable" and the implementation is a bit fiddly (but fast > > as hell). There's one in irregex too (which could be easily ported to > > SRFI-115) which comes close, performance-wise, and is a lot easier to > > understand and maintain. > > Could we specify a common interface for these implementations (or do they > already have the same interface)? This means they can also share the same > test suite, which ensures they are interchangeable (except for speed and > compatibility). Indeed, they already have the same test suite and interface. I started out with the alternatives because I did not like the current implementation due to it being hard to maintain (it's a direct port of a Haskell library which relies heavily on pattern matching/list destructuring which doesn't work so well in Scheme). Then, I got hooked at trying out all the parser generators we have in CHICKEN for their convenience and performance. I wrote up some notes about those too, but in the end I think I prefer the irregex-based one, which is a good tradeoff between the two. > The request abstraction could be specified so that it just gets the raw URL > as a string from the HTTP server. The the application could parse it before > passing it on to the router/dispatcher (or the r/d could call the library to > parse it). But is it more convenient if the request object already contains > the parsed URL? Do e.g. Apache of Nginx module get pre-parsed URLs from > those web servers? In that case it would probably not make sense to parse it > again ourselves. Most servers do _not_ parse anything. IMO this causes problems because different components in the stack have to implement their own parsing, and of course every component does it differently, leading to inconsistencies which may be exploitable as security problems, and it also seems to me to be less performant, because different components have to be re-parsing all the time. [1] http://bugs.call-cc.org/browser/release/5/uri-common/trunk/uri-common.scm#L237 Cheers, Peter