Patchwork [2,of,2,V4] hgweb: teach archive how to handle file patterns

login
register
mail settings
Submitter Angel Ezquerra
Date Feb. 10, 2013, 10:56 a.m.
Message ID <fb655ad16f6675265da9.1360493806@Angel-PC.localdomain>
Download mbox | patch
Permalink /patch/917/
State Superseded, archived
Headers show

Comments

Angel Ezquerra - Feb. 10, 2013, 10:56 a.m.
# HG changeset patch
# User Angel Ezquerra <angel.ezquerra@gmail.com>
# Date 1360493525 -3600
# Node ID fb655ad16f6675265da9d472ded7140a223fb283
# Parent  be3e96a41d0f4b7a1f1dd443f5261d6eeb66626a
hgweb: teach archive how to handle file patterns

The archive web command now takes into account the "file" request entry, if one
is provided.

The provided "file" is processed as a "path" pattern by default, which makes it
easy to only archive a certain file or directory. However, it is possible to
specify a different type of pattern, such as relglob by specifying it
explicitly on the query URL. Note that only "safe" patterns are allowed. Safe
patterns are 'path', 'relpath', 'glog' and 'relglob'. Other pattern types are
not allowed because they could be expensive to calculate.

With this change hgweb can to process requests such as:

1. http://mercurial.selenic.com/hg/archive/tip.zip/mercurial/templates

    This will download all files on the mercurial/templates directory as a zip
    file

2. http://mercurial.selenic.com/hg/archive/tip.tar.gz/relglob:*.py

    This will download all *.py files in the repository into a tar.gz file.

An so forth.

Note that this is a first step to add support for downloading directories from
the web interface. Currently the only way to use this feature is by manually
constructing the URL that you want to download. We will have to modify the
archiveentry map entry on the different templates so that it adds the current
folder path to the archive links.

This revision also adds a two tests for this feature to test-archive.t. The
first tests the selective archive feature and the second tests that the server
rejects "unsafe" patterns.
Mads Kiilerich - Feb. 14, 2013, 1:30 a.m.
Angel Ezquerra wrote, On 02/10/2013 11:56 AM:
> # HG changeset patch
> # User Angel Ezquerra <angel.ezquerra@gmail.com>
> # Date 1360493525 -3600
> # Node ID fb655ad16f6675265da9d472ded7140a223fb283
> # Parent  be3e96a41d0f4b7a1f1dd443f5261d6eeb66626a
> hgweb: teach archive how to handle file patterns
>
> The archive web command now takes into account the "file" request entry, if one
> is provided.
>
> The provided "file" is processed as a "path" pattern by default, which makes it
> easy to only archive a certain file or directory. However, it is possible to
> specify a different type of pattern, such as relglob by specifying it
> explicitly on the query URL. Note that only "safe" patterns are allowed. Safe
> patterns are 'path', 'relpath', 'glog' and 'relglob'. Other pattern types are
> not allowed because they could be expensive to calculate.
>
> With this change hgweb can to process requests such as:
>
> 1. http://mercurial.selenic.com/hg/archive/tip.zip/mercurial/templates
>
>      This will download all files on the mercurial/templates directory as a zip
>      file
>
> 2. http://mercurial.selenic.com/hg/archive/tip.tar.gz/relglob:*.py
>
>      This will download all *.py files in the repository into a tar.gz file.
>
> An so forth.
>
> Note that this is a first step to add support for downloading directories from
> the web interface. Currently the only way to use this feature is by manually
> constructing the URL that you want to download. We will have to modify the
> archiveentry map entry on the different templates so that it adds the current
> folder path to the archive links.
>
> This revision also adds a two tests for this feature to test-archive.t. The
> first tests the selective archive feature and the second tests that the server
> rejects "unsafe" patterns.
>
> diff --git a/mercurial/hgweb/webcommands.py b/mercurial/hgweb/webcommands.py
> --- a/mercurial/hgweb/webcommands.py
> +++ b/mercurial/hgweb/webcommands.py
> @@ -803,6 +803,17 @@
>       if cnode == key or key == 'tip':
>           arch_version = short(cnode)
>       name = "%s-%s" % (reponame, arch_version)
> +
> +    ctx = webutil.changectx(web.repo, req)
> +    pats = []
> +    file = req.form.get('file', None)
> +    defaultpat = 'path'
> +    if file:
> +        pats = [req.form['file'][0]]
> +        if not scmutil.patsaresafe(pats, defaultpat):
> +            msg = 'Archive pattern not allowed: %s' % pats[0]
> +            raise ErrorResponse(HTTP_FORBIDDEN, msg)
> +
>       mimetype, artype, extension, encoding = web.archive_specs[type_]
>       headers = [
>           ('Content-Disposition', 'attachment; filename=%s%s' % (name, extension))
> @@ -812,9 +823,9 @@
>       req.headers.extend(headers)
>       req.respond(HTTP_OK, mimetype)
>   
> -    ctx = webutil.changectx(web.repo, req)
> +    matchfn = scmutil.match(ctx, pats, default=defaultpat)
>       archival.archive(web.repo, req, cnode, artype, prefix=name,
> -                     matchfn=scmutil.match(ctx, []),
> +                     matchfn=matchfn,
>                        subrepos=web.configbool("web", "archivesubrepos"))
>       return []
>   
> diff --git a/mercurial/scmutil.py b/mercurial/scmutil.py
> --- a/mercurial/scmutil.py
> +++ b/mercurial/scmutil.py
> @@ -682,6 +682,15 @@
>   
>       return l
>   
> +def patsaresafe(pats, defaultpattype):
> +    for pat in pats:
> +        pattype = defaultpattype
> +        if ':' in pat:
> +            pattype = pat.split(':')[0]
> +        if pattype.lower() not in ('path', 'relpath', 'glog', 'relglob'):

(btw: relpath and relglob are completely undocumented in the patterns help.)

'glog' seems to be a typo. That indicates that the feature doesn't have 
good test coverage and also haven't been fully tested manually.

But both kinds of globs are in in my opinion not sufficiently safe. 
Consider for example the execution time for
   hg locate "glob:*********************x"
and how something like that can be used to denial of service attacks in 
hgweb.


I must say that I am no big fan of this feature as it is.
* It is conceptually too complex compared to the value it adds.
* Patterns can not be made explorable in hgweb and it is undocumented 
and there is no good place to document it.
* URLs thus has to be constructed manually ... and it is not obvious how 
to encode for instance globs with '?' in a url.
* It doesn't have the full power of specifying multiple patterns with -X 
and -I as we are used to when using patterns.
* It is not obvious which subset of patterns that can be used.

If something in this area is needed then I would suggest focusing on 
just making it possible to download a single directory as tar file. 
There is no need for a pattern - we only need a path after the archive, 
for instance .../archive/REV.tar.bz2/sub/dir .

/Mads
Angel Ezquerra - Feb. 14, 2013, 8:22 a.m.
On Thu, Feb 14, 2013 at 2:30 AM, Mads Kiilerich <mads@kiilerich.com> wrote:
> Angel Ezquerra wrote, On 02/10/2013 11:56 AM:
>
>> # HG changeset patch
>> # User Angel Ezquerra <angel.ezquerra@gmail.com>
>> # Date 1360493525 -3600
>> # Node ID fb655ad16f6675265da9d472ded7140a223fb283
>> # Parent  be3e96a41d0f4b7a1f1dd443f5261d6eeb66626a
>> hgweb: teach archive how to handle file patterns
>>
>> The archive web command now takes into account the "file" request entry,
>> if one
>> is provided.
>>
>> The provided "file" is processed as a "path" pattern by default, which
>> makes it
>> easy to only archive a certain file or directory. However, it is possible
>> to
>> specify a different type of pattern, such as relglob by specifying it
>> explicitly on the query URL. Note that only "safe" patterns are allowed.
>> Safe
>> patterns are 'path', 'relpath', 'glog' and 'relglob'. Other pattern types
>> are
>> not allowed because they could be expensive to calculate.
>>
>> With this change hgweb can to process requests such as:
>>
>> 1. http://mercurial.selenic.com/hg/archive/tip.zip/mercurial/templates
>>
>>      This will download all files on the mercurial/templates directory as
>> a zip
>>      file
>>
>> 2. http://mercurial.selenic.com/hg/archive/tip.tar.gz/relglob:*.py
>>
>>      This will download all *.py files in the repository into a tar.gz
>> file.
>>
>> An so forth.
>>
>> Note that this is a first step to add support for downloading directories
>> from
>> the web interface. Currently the only way to use this feature is by
>> manually
>> constructing the URL that you want to download. We will have to modify the
>> archiveentry map entry on the different templates so that it adds the
>> current
>> folder path to the archive links.
>>
>> This revision also adds a two tests for this feature to test-archive.t.
>> The
>> first tests the selective archive feature and the second tests that the
>> server
>> rejects "unsafe" patterns.
>>
>> diff --git a/mercurial/hgweb/webcommands.py
>> b/mercurial/hgweb/webcommands.py
>> --- a/mercurial/hgweb/webcommands.py
>> +++ b/mercurial/hgweb/webcommands.py
>> @@ -803,6 +803,17 @@
>>       if cnode == key or key == 'tip':
>>           arch_version = short(cnode)
>>       name = "%s-%s" % (reponame, arch_version)
>> +
>> +    ctx = webutil.changectx(web.repo, req)
>> +    pats = []
>> +    file = req.form.get('file', None)
>> +    defaultpat = 'path'
>> +    if file:
>> +        pats = [req.form['file'][0]]
>> +        if not scmutil.patsaresafe(pats, defaultpat):
>> +            msg = 'Archive pattern not allowed: %s' % pats[0]
>> +            raise ErrorResponse(HTTP_FORBIDDEN, msg)
>> +
>>       mimetype, artype, extension, encoding = web.archive_specs[type_]
>>       headers = [
>>           ('Content-Disposition', 'attachment; filename=%s%s' % (name,
>> extension))
>> @@ -812,9 +823,9 @@
>>       req.headers.extend(headers)
>>       req.respond(HTTP_OK, mimetype)
>>   -    ctx = webutil.changectx(web.repo, req)
>> +    matchfn = scmutil.match(ctx, pats, default=defaultpat)
>>       archival.archive(web.repo, req, cnode, artype, prefix=name,
>> -                     matchfn=scmutil.match(ctx, []),
>> +                     matchfn=matchfn,
>>                        subrepos=web.configbool("web", "archivesubrepos"))
>>       return []
>>   diff --git a/mercurial/scmutil.py b/mercurial/scmutil.py
>> --- a/mercurial/scmutil.py
>> +++ b/mercurial/scmutil.py
>> @@ -682,6 +682,15 @@
>>         return l
>>   +def patsaresafe(pats, defaultpattype):
>> +    for pat in pats:
>> +        pattype = defaultpattype
>> +        if ':' in pat:
>> +            pattype = pat.split(':')[0]
>> +        if pattype.lower() not in ('path', 'relpath', 'glog', 'relglob'):

Mads, thanks for your review. You raise some very valid concerns.
Please see my comments below.

> (btw: relpath and relglob are completely undocumented in the patterns help.)
>
> 'glog' seems to be a typo. That indicates that the feature doesn't have good
> test coverage and also haven't been fully tested manually.

The error is harmless in the sense that it means that glob patterns
would be rejected as well (i.e. the patsaresafe() method is actually
safer than it should :-)

I could add tests for all safe pattern types but I think that would be
overkill? Anyway I'm thinking this may not be necessary if I reduce
the scope of the pattern feature as I'll explain below.

> But both kinds of globs are in in my opinion not sufficiently safe. Consider
> for example the execution time for
>   hg locate "glob:*********************x"
> and how something like that can be used to denial of service attacks in
> hgweb.

True. The primary use cases for this feature are:

1. Download a single folder (with its subfolders)
2. Download a subset of the files, particularly the files matching a
set of filetypes, from "big" repositories.

The feature as currently implemented covers both use cases, but it
actually goes way beyond that. Maybe we can tight it up? That is,
maybe we can make sure that only those two very specific scenarios are
covered?

> I must say that I am no big fan of this feature as it is.
> * It is conceptually too complex compared to the value it adds.

I think we can simplify it while taking it a bit beyond the "download
a single folder" use case.

> * Patterns can not be made explorable in hgweb and it is undocumented and
> there is no good place to document it.

The obvious place to document them would be on the 403 error message
that we show when an "unsafe" pattern is shown.

> * URLs thus has to be constructed manually ...

True. I don't have a good solution for that. However hgweb already
lets you access URLs that you can construct but to which that there is
no way to get through the web interface (for example, explicit TAG
links, explicit RSS links, etc).

> and it is not obvious how to
> encode for instance globs with '?' in a url.

Also True. I guess that we could urldecode the selected file? In any
case I think that this is another indication that we should limit the
scope of what can be done further.

> * It doesn't have the full power of specifying multiple patterns with -X and
> -I as we are used to when using patterns.

I don't think this is necessary. What I mean is that this is not an
all or nothing proposition. The fact that we cannot do everything that
can be done with a proper mercurial client through the mercurial
command line does not mean that should not provide any of
functionality at all through hgweb!

> * It is not obvious which subset of patterns that can be used.

If we make the valid patterns much more explicit I think we can make
it quite obvious, and the 403 error can help educate the user.

What I have in mind is to completely disallow explicitly specifying
the pattern type. Instead we could just implicitly allow simple
"relglob" patterns when , which basically would behave as command line
shell globs. That means that we would only allow using a single "*" or
"**". If one of these were found we would run a relglob. If more than
one were found we would consider the pattern unsafe.

These are valid concerns. When I started implement this feature my
plan was just to add a way to download a single folder through the web
interface (to do so I still need to do another patch that changes the
templates to add the current path folder to archive links). However I
realized that it would be just as easy and almost free to also allow
to download specific types of files (what I had in mind was to
download all *.py files from a repository, for example). Now you have
convinced me that we must limit the scope of this much more.

> If something in this area is needed then I would suggest focusing on just
> making it possible to download a single directory as tar file. There is no
> need for a pattern - we only need a path after the archive, for instance
> .../archive/REV.tar.bz2/sub/dir .

This will definitely be possible and even accessible through the web
interface. I'd just like to also be able to do:

.../archive/REV.tar.bz2/sub/dir/**.{c,h}

Cheers,

Angel
Mads Kiilerich - Feb. 15, 2013, 1 a.m.
Angel Ezquerra wrote, On 02/14/2013 09:22 AM:
> On Thu, Feb 14, 2013 at 2:30 AM, Mads Kiilerich <mads@kiilerich.com> wrote:
>> Angel Ezquerra wrote, On 02/10/2013 11:56 AM:
>> If something in this area is needed then I would suggest focusing on just
>> making it possible to download a single directory as tar file. There is no
>> need for a pattern - we only need a path after the archive, for instance
>> .../archive/REV.tar.bz2/sub/dir .
> This will definitely be possible and even accessible through the web
> interface. I'd just like to also be able to do:
>
> .../archive/REV.tar.bz2/sub/dir/**.{c,h}

I don't know what Matt has suggested, but I would suggest targeting the 
"download a single folder" use case first. Support for 
".../archive/REV.tar.bz2/sub/dir/" could be exposed in the UI and it 
would be simple and it would be obvious what it did and it would be 
somewhat useful.

Patterns could perhaps be added later. It would be an extension of the 
existing functionality and the syntax would be (sufficiently) backwards 
compatible.

/Mads
Angel Ezquerra - Feb. 15, 2013, 1:23 a.m.
On Fri, Feb 15, 2013 at 2:00 AM, Mads Kiilerich <mads@kiilerich.com> wrote:
> Angel Ezquerra wrote, On 02/14/2013 09:22 AM:
>>
>> On Thu, Feb 14, 2013 at 2:30 AM, Mads Kiilerich <mads@kiilerich.com>
>> wrote:
>>>
>>> Angel Ezquerra wrote, On 02/10/2013 11:56 AM:
>>> If something in this area is needed then I would suggest focusing on just
>>> making it possible to download a single directory as tar file. There is
>>> no
>>> need for a pattern - we only need a path after the archive, for instance
>>> .../archive/REV.tar.bz2/sub/dir .
>>
>> This will definitely be possible and even accessible through the web
>> interface. I'd just like to also be able to do:
>>
>> .../archive/REV.tar.bz2/sub/dir/**.{c,h}
>
>
> I don't know what Matt has suggested, but I would suggest targeting the
> "download a single folder" use case first. Support for
> ".../archive/REV.tar.bz2/sub/dir/" could be exposed in the UI and it would
> be simple and it would be obvious what it did and it would be somewhat
> useful.
>
> Patterns could perhaps be added later. It would be an extension of the
> existing functionality and the syntax would be (sufficiently) backwards
> compatible.
>
> /Mads

OK,

I will reduce the scope of the patch for now, and expose it through
the web interface as I planned. I'll propose the extension to use some
limited sort of patterns on a separate patch.

Cheers,

Angel

Patch

diff --git a/mercurial/hgweb/webcommands.py b/mercurial/hgweb/webcommands.py
--- a/mercurial/hgweb/webcommands.py
+++ b/mercurial/hgweb/webcommands.py
@@ -803,6 +803,17 @@ 
     if cnode == key or key == 'tip':
         arch_version = short(cnode)
     name = "%s-%s" % (reponame, arch_version)
+
+    ctx = webutil.changectx(web.repo, req)
+    pats = []
+    file = req.form.get('file', None)
+    defaultpat = 'path'
+    if file:
+        pats = [req.form['file'][0]]
+        if not scmutil.patsaresafe(pats, defaultpat):
+            msg = 'Archive pattern not allowed: %s' % pats[0]
+            raise ErrorResponse(HTTP_FORBIDDEN, msg)
+
     mimetype, artype, extension, encoding = web.archive_specs[type_]
     headers = [
         ('Content-Disposition', 'attachment; filename=%s%s' % (name, extension))
@@ -812,9 +823,9 @@ 
     req.headers.extend(headers)
     req.respond(HTTP_OK, mimetype)
 
-    ctx = webutil.changectx(web.repo, req)
+    matchfn = scmutil.match(ctx, pats, default=defaultpat)
     archival.archive(web.repo, req, cnode, artype, prefix=name,
-                     matchfn=scmutil.match(ctx, []),
+                     matchfn=matchfn,
                      subrepos=web.configbool("web", "archivesubrepos"))
     return []
 
diff --git a/mercurial/scmutil.py b/mercurial/scmutil.py
--- a/mercurial/scmutil.py
+++ b/mercurial/scmutil.py
@@ -682,6 +682,15 @@ 
 
     return l
 
+def patsaresafe(pats, defaultpattype):
+    for pat in pats:
+        pattype = defaultpattype
+        if ':' in pat:
+            pattype = pat.split(':')[0]
+        if pattype.lower() not in ('path', 'relpath', 'glog', 'relglob'):
+            return False
+    return True
+
 def expandpats(pats):
     if not util.expandglobs:
         return list(pats)
diff --git a/tests/test-archive.t b/tests/test-archive.t
--- a/tests/test-archive.t
+++ b/tests/test-archive.t
@@ -100,6 +100,13 @@ 
       testing: test-archive-2c0277f05ed4/baz/bletch   OK
       testing: test-archive-2c0277f05ed4/foo   OK
   No errors detected in compressed data of archive.zip.
+  $ python getarchive.py "$TIP" gz baz | gunzip | tar tf - 2>/dev/null
+  test-archive-2c0277f05ed4/baz/bletch
+
+test that we reject unsafe patterns
+
+  $ python getarchive.py "$TIP" gz relre:baz
+  HTTP Error 403: Archive pattern not allowed: relre:baz
 
   $ "$TESTDIR/killdaemons.py" $DAEMON_PIDS