Patchwork D2776: hgweb: use a multidict for holding query string parameters

login
register
mail settings
Submitter phabricator
Date March 10, 2018, 4:53 a.m.
Message ID <differential-rev-PHID-DREV-pcpj3jhyxdrgmsvfat24-req@phab.mercurial-scm.org>
Download mbox | patch
Permalink /patch/29238/
State Superseded
Headers show

Comments

phabricator - March 10, 2018, 4:53 a.m.
indygreg created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  My intention with refactoring the WSGI code was to make it easier
  to read. I initially wanted to vendor and use WebOb, because it seems
  to be a pretty reasonable abstraction layer for WSGI. However, it isn't
  using relative imports and I didn't want to deal with the hassle of
  patching it. But that doesn't mean we can't use good ideas from WebOb.
  
  WebOb has a "multidict" data structure for holding parsed query string
  and POST form data. It quacks like a dict but allows you to store
  multiple values for each key. It offers mechanisms to return just one
  value, all values, or return 1 value asserting that only 1 value is
  set. I quite like its API.
  
  This commit implements a read-only "multidict" in the spirit of
  WebOb's multidict.
  
  We replace the query string attributes of our parsed request with
  an instance of it.
  
  For the record, I'm not a huge fan of the method to convert instances
  to a dict of lists. But this is needed to support the wsgirequest.form
  API.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2776

AFFECTED FILES
  mercurial/hgweb/request.py
  mercurial/wireprotoserver.py

CHANGE DETAILS




To: indygreg, #hg-reviewers
Cc: mercurial-devel
phabricator - March 12, 2018, 9:04 p.m.
durin42 added inline comments.

INLINE COMMENTS

> request.py:31
>  
> +class multidict(object):
> +    """A dict like object that can store multiple values for a key.

I'm a little uncomfortable with this not advertising that it's not a hash table, and htat lookups are O(N). Maybe send a docstring followup?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2776

To: indygreg, #hg-reviewers
Cc: durin42, mercurial-devel
phabricator - March 14, 2018, 4:33 a.m.
martinvonz added inline comments.

INLINE COMMENTS

> request.py:39-41
> +        # Stores (key, value) 2-tuples. This isn't the most efficient. But we
> +        # don't rely on parameters that much, so it shouldn't be a perf issue.
> +        # we can always add dict for fast lookups.

Sure, that's probably fine, but why? Wouldn't it be easier to write it as dict of lists anyway?

> request.py:45
> +    def __getitem__(self, key):
> +        """Returns the last set value for a key."""
> +        for k, v in reversed(self._items):

Would it make sense to make this an error if there isn't exactly one value?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2776

To: indygreg, #hg-reviewers, durin42
Cc: martinvonz, durin42, mercurial-devel
phabricator - March 14, 2018, 9:39 a.m.
av6 added inline comments.

INLINE COMMENTS

> martinvonz wrote in request.py:45
> Would it make sense to make this an error if there isn't exactly one value?

I don't think it would. AIUI, that would make handling query strings like ?style=foo&style=bar just more difficult. I'm not aware of any web tools where default access method wouldn't simply return that last value.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2776

To: indygreg, #hg-reviewers, durin42
Cc: av6, martinvonz, durin42, mercurial-devel
phabricator - March 14, 2018, 2:51 p.m.
indygreg added inline comments.

INLINE COMMENTS

> durin42 wrote in request.py:31
> I'm a little uncomfortable with this not advertising that it's not a hash table, and htat lookups are O(N). Maybe send a docstring followup?

I agree with the sentiment about this being a list in disguise. One reason I didn't bother to optimize it is because I don't think we do any `qsparams` lookups in loops and I don't believe we have any more than ~10 arguments to any single request. So even if we have an `O(n^2)` situation, n is so small that it doesn't matter.

I can add a follow-up comment easily enough. Or we could just index the fields by key. That doesn't seem too difficult.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2776

To: indygreg, #hg-reviewers, durin42
Cc: av6, martinvonz, durin42, mercurial-devel
phabricator - March 15, 2018, 3:38 p.m.
yuja added inline comments.

INLINE COMMENTS

> indygreg wrote in request.py:31
> I agree with the sentiment about this being a list in disguise. One reason I didn't bother to optimize it is because I don't think we do any `qsparams` lookups in loops and I don't believe we have any more than ~10 arguments to any single request. So even if we have an `O(n^2)` situation, n is so small that it doesn't matter.
> 
> I can add a follow-up comment easily enough. Or we could just index the fields by key. That doesn't seem too difficult.

Isn't the `n` controllable by a malicious user?

I agree with @martinvonz in that it would be probably easier to
write as a dict of lists (or dict + lists).

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2776

To: indygreg, #hg-reviewers, durin42
Cc: yuja, av6, martinvonz, durin42, mercurial-devel
phabricator - March 15, 2018, 6:27 p.m.
indygreg added a comment.


  I may have a shot at rewriting this class today. I do want to prioritize on adding missing components to the new wire protocol so reviewers have better context, however. So if someone else wants to do the work, I'd happily review it. Otherwise, I should get around to it sometime in the next few days.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2776

To: indygreg, #hg-reviewers, durin42
Cc: yuja, av6, martinvonz, durin42, mercurial-devel

Patch

diff --git a/mercurial/wireprotoserver.py b/mercurial/wireprotoserver.py
--- a/mercurial/wireprotoserver.py
+++ b/mercurial/wireprotoserver.py
@@ -170,10 +170,10 @@ 
     # HTTP version 1 wire protocol requests are denoted by a "cmd" query
     # string parameter. If it isn't present, this isn't a wire protocol
     # request.
-    if 'cmd' not in req.querystringdict:
+    if 'cmd' not in req.qsparams:
         return False
 
-    cmd = req.querystringdict['cmd'][0]
+    cmd = req.qsparams['cmd']
 
     # The "cmd" request parameter is used by both the wire protocol and hgweb.
     # While not all wire protocol commands are available for all transports,
diff --git a/mercurial/hgweb/request.py b/mercurial/hgweb/request.py
--- a/mercurial/hgweb/request.py
+++ b/mercurial/hgweb/request.py
@@ -28,6 +28,84 @@ 
     util,
 )
 
+class multidict(object):
+    """A dict like object that can store multiple values for a key.
+
+    Used to store parsed request parameters.
+
+    This is inspired by WebOb's class of the same name.
+    """
+    def __init__(self):
+        # Stores (key, value) 2-tuples. This isn't the most efficient. But we
+        # don't rely on parameters that much, so it shouldn't be a perf issue.
+        # we can always add dict for fast lookups.
+        self._items = []
+
+    def __getitem__(self, key):
+        """Returns the last set value for a key."""
+        for k, v in reversed(self._items):
+            if k == key:
+                return v
+
+        raise KeyError(key)
+
+    def __setitem__(self, key, value):
+        """Replace a values for a key with a new value."""
+        try:
+            del self[key]
+        except KeyError:
+            pass
+
+        self._items.append((key, value))
+
+    def __delitem__(self, key):
+        """Delete all values for a key."""
+        oldlen = len(self._items)
+
+        self._items[:] = [(k, v) for k, v in self._items if k != key]
+
+        if oldlen == len(self._items):
+            raise KeyError(key)
+
+    def __contains__(self, key):
+        return any(k == key for k, v in self._items)
+
+    def __len__(self):
+        return len(self._items)
+
+    def add(self, key, value):
+        """Add a new value for a key. Does not replace existing values."""
+        self._items.append((key, value))
+
+    def getall(self, key):
+        """Obtains all values for a key."""
+        return [v for k, v in self._items if k == key]
+
+    def getone(self, key):
+        """Obtain a single value for a key.
+
+        Raises KeyError if key not defined or it has multiple values set.
+        """
+        vals = self.getall(key)
+
+        if not vals:
+            raise KeyError(key)
+
+        if len(vals) > 1:
+            raise KeyError('multiple values for %r' % key)
+
+        return vals[0]
+
+    def asdictoflists(self):
+        d = {}
+        for k, v in self._items:
+            if k in d:
+                d[k].append(v)
+            else:
+                d[k] = [v]
+
+        return d
+
 @attr.s(frozen=True)
 class parsedrequest(object):
     """Represents a parsed WSGI request.
@@ -56,10 +134,8 @@ 
     havepathinfo = attr.ib()
     # Raw query string (part after "?" in URL).
     querystring = attr.ib()
-    # List of 2-tuples of query string arguments.
-    querystringlist = attr.ib()
-    # Dict of query string arguments. Values are lists with at least 1 item.
-    querystringdict = attr.ib()
+    # multidict of query string parameters.
+    qsparams = attr.ib()
     # wsgiref.headers.Headers instance. Operates like a dict with case
     # insensitive keys.
     headers = attr.ib()
@@ -157,14 +233,9 @@ 
 
     # We store as a list so we have ordering information. We also store as
     # a dict to facilitate fast lookup.
-    querystringlist = util.urlreq.parseqsl(querystring, keep_blank_values=True)
-
-    querystringdict = {}
-    for k, v in querystringlist:
-        if k in querystringdict:
-            querystringdict[k].append(v)
-        else:
-            querystringdict[k] = [v]
+    qsparams = multidict()
+    for k, v in util.urlreq.parseqsl(querystring, keep_blank_values=True):
+        qsparams.add(k, v)
 
     # HTTP_* keys contain HTTP request headers. The Headers structure should
     # perform case normalization for us. We just rewrite underscore to dash
@@ -197,8 +268,7 @@ 
                          dispatchparts=dispatchparts, dispatchpath=dispatchpath,
                          havepathinfo='PATH_INFO' in env,
                          querystring=querystring,
-                         querystringlist=querystringlist,
-                         querystringdict=querystringdict,
+                         qsparams=qsparams,
                          headers=headers,
                          bodyfh=bodyfh)
 
@@ -328,7 +398,7 @@ 
         self.run_once = wsgienv[r'wsgi.run_once']
         self.env = wsgienv
         self.req = parserequestfromenv(wsgienv, inp)
-        self.form = self.req.querystringdict
+        self.form = self.req.qsparams.asdictoflists()
         self.res = wsgiresponse(self.req, start_response)
         self._start_response = start_response
         self.server_write = None