Patchwork [1,of,2,V2] hgweb: add a hook for processing LFS Batch API requests

login
register
mail settings
Submitter Matt Harbison
Date March 8, 2018, 4:49 a.m.
Message ID <89382cb20bb19e089513.1520484540@Envy>
Download mbox | patch
Permalink /patch/29114/
State New
Headers show

Comments

Matt Harbison - March 8, 2018, 4:49 a.m.
# HG changeset patch
# User Matt Harbison <matt_harbison@yahoo.com>
# Date 1519274700 18000
#      Wed Feb 21 23:45:00 2018 -0500
# Node ID 89382cb20bb19e089513b2ce69ef8acfa1f523fd
# Parent  6dab3bdb1f00932a80ffa07f80ba240c3a4b48df
hgweb: add a hook for processing LFS Batch API requests

There really isn't a clean way to give LFS a crack at intercepting the requests
without hardcoding some LFS knowledge in the core.  The rationale for this URI
is that the spec for the Batch API[1] defines the URL as the LFS server url +
'/objects/batch'.  The default git URLs are:

    Git remote: https://git-server.com/foo/bar
    LFS server: https://git-server.com/foo/bar.git/info/lfs
    Batch API: https://git-server.com/foo/bar.git/info/lfs/objects/batch

'.git/' seems like it's not something a user would normally track.  If we adhere
to how git defines the URLs, then the hg-git extension should be able to talk to
a git based server without any additional work.

I'm not sure if checking the User-Agent is a good idea, but this needs a
specialized client, and it seems like everyone else is doing it (3d48ae1aaa5e,
e7bb5fc4570c).  We can always back this off if it becomes a nuisance.  A web
browser will see "400: bad method", the same as it would before this change.

I'm not sure if None or 'pull' is the proper permission check, but the only
difference is whether or not `hgweb.allowpull` is checked.  Since nothing of
particular interest is transferred here, and the next phase handles the read or
write, treating this like web interface request seems fine.

[1] https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md
Gregory Szorc - March 8, 2018, 6:17 a.m.
On Wed, Mar 7, 2018 at 8:49 PM, Matt Harbison <mharbison72@gmail.com> wrote:

> # HG changeset patch
> # User Matt Harbison <matt_harbison@yahoo.com>
> # Date 1519274700 18000
> #      Wed Feb 21 23:45:00 2018 -0500
> # Node ID 89382cb20bb19e089513b2ce69ef8acfa1f523fd
> # Parent  6dab3bdb1f00932a80ffa07f80ba240c3a4b48df
> hgweb: add a hook for processing LFS Batch API requests
>
> There really isn't a clean way to give LFS a crack at intercepting the
> requests
> without hardcoding some LFS knowledge in the core.  The rationale for this
> URI
> is that the spec for the Batch API[1] defines the URL as the LFS server
> url +
> '/objects/batch'.  The default git URLs are:
>
>     Git remote: https://git-server.com/foo/bar
>     LFS server: https://git-server.com/foo/bar.git/info/lfs
>     Batch API: https://git-server.com/foo/bar.git/info/lfs/objects/batch
>
> '.git/' seems like it's not something a user would normally track.  If we
> adhere
> to how git defines the URLs, then the hg-git extension should be able to
> talk to
> a git based server without any additional work.
>
> I'm not sure if checking the User-Agent is a good idea, but this needs a
> specialized client, and it seems like everyone else is doing it
> (3d48ae1aaa5e,
> e7bb5fc4570c).  We can always back this off if it becomes a nuisance.  A
> web
> browser will see "400: bad method", the same as it would before this
> change.
>
> I'm not sure if None or 'pull' is the proper permission check, but the only
> difference is whether or not `hgweb.allowpull` is checked.  Since nothing
> of
> particular interest is transferred here, and the next phase handles the
> read or
> write, treating this like web interface request seems fine.
>
> [1] https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md
>
> diff --git a/mercurial/hgweb/hgweb_mod.py b/mercurial/hgweb/hgweb_mod.py
> --- a/mercurial/hgweb/hgweb_mod.py
> +++ b/mercurial/hgweb/hgweb_mod.py
> @@ -90,6 +90,12 @@
>          urlel = os.path.dirname(urlel)
>      return reversed(breadcrumb)
>
> +def _processlfsbatchreq(repo, req):
> +    """A hook for the LFS extension to wrap that handles requests to the
> Batch
> +    API, and returns the appropriate JSON response.
> +    """
> +    raise ErrorResponse(HTTP_NOT_FOUND)
> +
>  class requestcontext(object):
>      """Holds state/context for an individual request.
>
> @@ -376,6 +382,21 @@
>              except ErrorResponse as inst:
>                  return protohandler['handleerror'](inst)
>
> +        # Route LFS Batch API requests to the appropriate handler
> +
> +        if req.env.get(r'HTTP_USER_AGENT', r'').startswith(r'git-lfs/'):
>

I don't like filtering by the user-agent. It is recommended to not do this.
Unless it will cause problems or tons of code complexity to properly lock
out actual Git clients who may get confused by this, my vote is to remove
it and just rely on path filtering.


> +            try:
> +                path = req.env.get(r'PATH_INFO')
> +                if path == '/.git/info/lfs/objects/batch':
>

Relying on PATH_INFO is super annoying. But our code kind of sucks :/

My only suggestion would be to define `parts` in the `else` block of the
`if r'PATH_INFO' in req.env:` above and check `parts == ['.git', 'info',
'lfs', 'objects', 'batch']`. That's still a big ugly though.

Whatever happens, my guess is I will eventually write some patches to clean
the URL parsing code up.


> +                    self.check_perm(rctx, req, None)
>

Shouldn't this be `pull` instead of `None`? Otherwise, LFS won't honor
web.* to restrict data access.


> +                    return _processlfsbatchreq(rctx.repo, req)
> +                else:
> +                    raise ErrorResponse(HTTP_NOT_FOUND)
> +            except ErrorResponse as inst:
> +                req.respond(inst, 'text/plain; charset=utf-8')
> +                # No body, since only lfs clients are allowed here
> +                return ['']
> +
>          # translate user-visible url structure to internal structure
>
>          args = query.split('/', 2)
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Matt Harbison - March 8, 2018, 1:06 p.m.
> On Mar 8, 2018, at 1:17 AM, Gregory Szorc <gregory.szorc@gmail.com> wrote:
> 
>> On Wed, Mar 7, 2018 at 8:49 PM, Matt Harbison <mharbison72@gmail.com> wrote:
>> # HG changeset patch
>> # User Matt Harbison <matt_harbison@yahoo.com>
>> # Date 1519274700 18000
>> #      Wed Feb 21 23:45:00 2018 -0500
>> # Node ID 89382cb20bb19e089513b2ce69ef8acfa1f523fd
>> # Parent  6dab3bdb1f00932a80ffa07f80ba240c3a4b48df
>> hgweb: add a hook for processing LFS Batch API requests
>> 
>> There really isn't a clean way to give LFS a crack at intercepting the requests
>> without hardcoding some LFS knowledge in the core.  The rationale for this URI
>> is that the spec for the Batch API[1] defines the URL as the LFS server url +
>> '/objects/batch'.  The default git URLs are:
>> 
>>     Git remote: https://git-server.com/foo/bar
>>     LFS server: https://git-server.com/foo/bar.git/info/lfs
>>     Batch API: https://git-server.com/foo/bar.git/info/lfs/objects/batch
>> 
>> '.git/' seems like it's not something a user would normally track.  If we adhere
>> to how git defines the URLs, then the hg-git extension should be able to talk to
>> a git based server without any additional work.
>> 
>> I'm not sure if checking the User-Agent is a good idea, but this needs a
>> specialized client, and it seems like everyone else is doing it (3d48ae1aaa5e,
>> e7bb5fc4570c).  We can always back this off if it becomes a nuisance.  A web
>> browser will see "400: bad method", the same as it would before this change.
>> 
>> I'm not sure if None or 'pull' is the proper permission check, but the only
>> difference is whether or not `hgweb.allowpull` is checked.  Since nothing of
>> particular interest is transferred here, and the next phase handles the read or
>> write, treating this like web interface request seems fine.
>> 
>> [1] https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md
>> 
>> diff --git a/mercurial/hgweb/hgweb_mod.py b/mercurial/hgweb/hgweb_mod.py
>> --- a/mercurial/hgweb/hgweb_mod.py
>> +++ b/mercurial/hgweb/hgweb_mod.py
>> @@ -90,6 +90,12 @@
>>          urlel = os.path.dirname(urlel)
>>      return reversed(breadcrumb)
>> 
>> +def _processlfsbatchreq(repo, req):
>> +    """A hook for the LFS extension to wrap that handles requests to the Batch
>> +    API, and returns the appropriate JSON response.
>> +    """
>> +    raise ErrorResponse(HTTP_NOT_FOUND)
>> +
>>  class requestcontext(object):
>>      """Holds state/context for an individual request.
>> 
>> @@ -376,6 +382,21 @@
>>              except ErrorResponse as inst:
>>                  return protohandler['handleerror'](inst)
>> 
>> +        # Route LFS Batch API requests to the appropriate handler
>> +
>> +        if req.env.get(r'HTTP_USER_AGENT', r'').startswith(r'git-lfs/'):
> 
> I don't like filtering by the user-agent. It is recommended to not do this. Unless it will cause problems or tons of code complexity to properly lock out actual Git clients who may get confused by this, my vote is to remove it and just rely on path filtering.

Ok, I can ditch it.  It was more about locking out web browsers and other things not POSTing a JSON request, without changing the current behavior they see.  But I guess there are other ways to remotely figure out the minimum server version.

>> +            try:
>> +                path = req.env.get(r'PATH_INFO')
>> +                if path == '/.git/info/lfs/objects/batch':
> 
> Relying on PATH_INFO is super annoying. But our code kind of sucks :/
> 
> My only suggestion would be to define `parts` in the `else` block of the `if r'PATH_INFO' in req.env:` above and check `parts == ['.git', 'info', 'lfs', 'objects', 'batch']`. That's still a big ugly though.

OoC, what’s the point of this?  I saw it in the code above this, but it seems like string handling is easier than list-of-string handling.  This looks like an easy enough change though.

> Whatever happens, my guess is I will eventually write some patches to clean the URL parsing code up.
>  
>> +                    self.check_perm(rctx, req, None)
> 
> Shouldn't this be `pull` instead of `None`? Otherwise, LFS won't honor web.* to restrict data access.

I went back and forth, but landed on this because it doesn't seem like a pull, and still checks web.deny_read and web.allow_read.  It looks like None just doesn’t check web.allowpull.  This would be checked prior to the actual file transfer in the next step.

Is there a concept of a write-only repo?  Like a try server where you can only push stuff, but not pull all of the anonymous heads.  If not, then I think I can change it.

>> +                    return _processlfsbatchreq(rctx.repo, req)
>> +                else:
>> +                    raise ErrorResponse(HTTP_NOT_FOUND)
>> +            except ErrorResponse as inst:
>> +                req.respond(inst, 'text/plain; charset=utf-8')
>> +                # No body, since only lfs clients are allowed here
>> +                return ['']
>> +
>>          # translate user-visible url structure to internal structure
>> 
>>          args = query.split('/', 2)
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Matt Harbison - March 8, 2018, 2:30 p.m.
> On Mar 8, 2018, at 1:17 AM, Gregory Szorc <gregory.szorc@gmail.com> wrote:
> 
>> On Wed, Mar 7, 2018 at 8:49 PM, Matt Harbison <mharbison72@gmail.com> wrote:
>> # HG changeset patch
>> # User Matt Harbison <matt_harbison@yahoo.com>
>> # Date 1519274700 18000
>> #      Wed Feb 21 23:45:00 2018 -0500
>> # Node ID 89382cb20bb19e089513b2ce69ef8acfa1f523fd
>> # Parent  6dab3bdb1f00932a80ffa07f80ba240c3a4b48df
>> hgweb: add a hook for processing LFS Batch API requests
>> 
>> There really isn't a clean way to give LFS a crack at intercepting the requests
>> without hardcoding some LFS knowledge in the core.  The rationale for this URI
>> is that the spec for the Batch API[1] defines the URL as the LFS server url +
>> '/objects/batch'.  The default git URLs are:
>> 
>>     Git remote: https://git-server.com/foo/bar
>>     LFS server: https://git-server.com/foo/bar.git/info/lfs
>>     Batch API: https://git-server.com/foo/bar.git/info/lfs/objects/batch
>> 
>> '.git/' seems like it's not something a user would normally track.  If we adhere
>> to how git defines the URLs, then the hg-git extension should be able to talk to
>> a git based server without any additional work.
>> 
>> I'm not sure if checking the User-Agent is a good idea, but this needs a
>> specialized client, and it seems like everyone else is doing it (3d48ae1aaa5e,
>> e7bb5fc4570c).  We can always back this off if it becomes a nuisance.  A web
>> browser will see "400: bad method", the same as it would before this change.
>> 
>> I'm not sure if None or 'pull' is the proper permission check, but the only
>> difference is whether or not `hgweb.allowpull` is checked.  Since nothing of
>> particular interest is transferred here, and the next phase handles the read or
>> write, treating this like web interface request seems fine.
>> 
>> [1] https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md
>> 
>> diff --git a/mercurial/hgweb/hgweb_mod.py b/mercurial/hgweb/hgweb_mod.py
>> --- a/mercurial/hgweb/hgweb_mod.py
>> +++ b/mercurial/hgweb/hgweb_mod.py
>> @@ -90,6 +90,12 @@
>>          urlel = os.path.dirname(urlel)
>>      return reversed(breadcrumb)
>> 
>> +def _processlfsbatchreq(repo, req):
>> +    """A hook for the LFS extension to wrap that handles requests to the Batch
>> +    API, and returns the appropriate JSON response.
>> +    """
>> +    raise ErrorResponse(HTTP_NOT_FOUND)
>> +
>>  class requestcontext(object):
>>      """Holds state/context for an individual request.
>> 
>> @@ -376,6 +382,21 @@
>>              except ErrorResponse as inst:
>>                  return protohandler['handleerror'](inst)
>> 
>> +        # Route LFS Batch API requests to the appropriate handler
>> +
>> +        if req.env.get(r'HTTP_USER_AGENT', r'').startswith(r'git-lfs/'):
> 
> I don't like filtering by the user-agent. It is recommended to not do this. Unless it will cause problems or tons of code complexity to properly lock out actual Git clients who may get confused by this, my vote is to remove it and just rely on path filtering.
>  
>> +            try:
>> +                path = req.env.get(r'PATH_INFO')
>> +                if path == '/.git/info/lfs/objects/batch':
> 
> Relying on PATH_INFO is super annoying. But our code kind of sucks :/
> 
> My only suggestion would be to define `parts` in the `else` block of the `if r'PATH_INFO' in req.env:` above and check `parts == ['.git', 'info', 'lfs', 'objects', 'batch']`. That's still a big ugly though.

Also, what’s the str.startswith() equivalent for this list, to work with the next patch?

> Whatever happens, my guess is I will eventually write some patches to clean the URL parsing code up.
>  
>> +                    self.check_perm(rctx, req, None)
> 
> Shouldn't this be `pull` instead of `None`? Otherwise, LFS won't honor web.* to restrict data access.
>  
>> +                    return _processlfsbatchreq(rctx.repo, req)
>> +                else:
>> +                    raise ErrorResponse(HTTP_NOT_FOUND)
>> +            except ErrorResponse as inst:
>> +                req.respond(inst, 'text/plain; charset=utf-8')
>> +                # No body, since only lfs clients are allowed here
>> +                return ['']
>> +
>>          # translate user-visible url structure to internal structure
>> 
>>          args = query.split('/', 2)
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Gregory Szorc - March 8, 2018, 6:38 p.m.
On Thu, Mar 8, 2018 at 5:06 AM, Matt Harbison <mharbison72@gmail.com> wrote:

>
> On Mar 8, 2018, at 1:17 AM, Gregory Szorc <gregory.szorc@gmail.com> wrote:
>
> On Wed, Mar 7, 2018 at 8:49 PM, Matt Harbison <mharbison72@gmail.com>
> wrote:
>
>> # HG changeset patch
>> # User Matt Harbison <matt_harbison@yahoo.com>
>> # Date 1519274700 18000
>> #      Wed Feb 21 23:45:00 2018 -0500
>> # Node ID 89382cb20bb19e089513b2ce69ef8acfa1f523fd
>> # Parent  6dab3bdb1f00932a80ffa07f80ba240c3a4b48df
>> hgweb: add a hook for processing LFS Batch API requests
>>
>> There really isn't a clean way to give LFS a crack at intercepting the
>> requests
>> without hardcoding some LFS knowledge in the core.  The rationale for
>> this URI
>> is that the spec for the Batch API[1] defines the URL as the LFS server
>> url +
>> '/objects/batch'.  The default git URLs are:
>>
>>     Git remote: https://git-server.com/foo/bar
>>     LFS server: https://git-server.com/foo/bar.git/info/lfs
>>     Batch API: https://git-server.com/foo/bar.git/info/lfs/objects/batch
>>
>> '.git/' seems like it's not something a user would normally track.  If we
>> adhere
>> to how git defines the URLs, then the hg-git extension should be able to
>> talk to
>> a git based server without any additional work.
>>
>> I'm not sure if checking the User-Agent is a good idea, but this needs a
>> specialized client, and it seems like everyone else is doing it
>> (3d48ae1aaa5e,
>> e7bb5fc4570c).  We can always back this off if it becomes a nuisance.  A
>> web
>> browser will see "400: bad method", the same as it would before this
>> change.
>>
>> I'm not sure if None or 'pull' is the proper permission check, but the
>> only
>> difference is whether or not `hgweb.allowpull` is checked.  Since nothing
>> of
>> particular interest is transferred here, and the next phase handles the
>> read or
>> write, treating this like web interface request seems fine.
>>
>> [1] https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md
>>
>> diff --git a/mercurial/hgweb/hgweb_mod.py b/mercurial/hgweb/hgweb_mod.py
>> --- a/mercurial/hgweb/hgweb_mod.py
>> +++ b/mercurial/hgweb/hgweb_mod.py
>> @@ -90,6 +90,12 @@
>>          urlel = os.path.dirname(urlel)
>>      return reversed(breadcrumb)
>>
>> +def _processlfsbatchreq(repo, req):
>> +    """A hook for the LFS extension to wrap that handles requests to the
>> Batch
>> +    API, and returns the appropriate JSON response.
>> +    """
>> +    raise ErrorResponse(HTTP_NOT_FOUND)
>> +
>>  class requestcontext(object):
>>      """Holds state/context for an individual request.
>>
>> @@ -376,6 +382,21 @@
>>              except ErrorResponse as inst:
>>                  return protohandler['handleerror'](inst)
>>
>> +        # Route LFS Batch API requests to the appropriate handler
>> +
>> +        if req.env.get(r'HTTP_USER_AGENT', r'').startswith(r'git-lfs/'):
>>
>
> I don't like filtering by the user-agent. It is recommended to not do
> this. Unless it will cause problems or tons of code complexity to properly
> lock out actual Git clients who may get confused by this, my vote is to
> remove it and just rely on path filtering.
>
>
> Ok, I can ditch it.  It was more about locking out web browsers and other
> things not POSTing a JSON request, without changing the current behavior
> they see.  But I guess there are other ways to remotely figure out the
> minimum server version.
>
> +            try:
>> +                path = req.env.get(r'PATH_INFO')
>> +                if path == '/.git/info/lfs/objects/batch':
>>
>
> Relying on PATH_INFO is super annoying. But our code kind of sucks :/
>
> My only suggestion would be to define `parts` in the `else` block of the
> `if r'PATH_INFO' in req.env:` above and check `parts == ['.git', 'info',
> 'lfs', 'objects', 'batch']`. That's still a big ugly though.
>
>
> OoC, what’s the point of this?  I saw it in the code above this, but it
> seems like string handling is easier than list-of-string handling.  This
> looks like an easy enough change though.
>

It turns out my HTTP protocol work has spurred me to author a proper fix
for this madness. Expect some patches from me today to completely overhaul
this URL code.


>
> Whatever happens, my guess is I will eventually write some patches to
> clean the URL parsing code up.
>
>
>> +                    self.check_perm(rctx, req, None)
>>
>
> Shouldn't this be `pull` instead of `None`? Otherwise, LFS won't honor
> web.* to restrict data access.
>
>
> I went back and forth, but landed on this because it doesn't seem like a
> pull, and still checks web.deny_read and web.allow_read.  It looks like
> None just doesn’t check web.allowpull.  This would be checked prior to the
> actual file transfer in the next step.
>
> Is there a concept of a write-only repo?  Like a try server where you can
> only push stuff, but not pull all of the anonymous heads.  If not, then I
> think I can change it.
>

No there isn't. I'm not sure how that would be implemented, since commands
used for discovery (known, heads, etc) are called on both read and write
operations. The "push" and "pull" names aren't that good :/


>
> +                    return _processlfsbatchreq(rctx.repo, req)
>> +                else:
>> +                    raise ErrorResponse(HTTP_NOT_FOUND)
>> +            except ErrorResponse as inst:
>> +                req.respond(inst, 'text/plain; charset=utf-8')
>> +                # No body, since only lfs clients are allowed here
>> +                return ['']
>> +
>>          # translate user-visible url structure to internal structure
>>
>>          args = query.split('/', 2)
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>
>

Patch

diff --git a/mercurial/hgweb/hgweb_mod.py b/mercurial/hgweb/hgweb_mod.py
--- a/mercurial/hgweb/hgweb_mod.py
+++ b/mercurial/hgweb/hgweb_mod.py
@@ -90,6 +90,12 @@ 
         urlel = os.path.dirname(urlel)
     return reversed(breadcrumb)
 
+def _processlfsbatchreq(repo, req):
+    """A hook for the LFS extension to wrap that handles requests to the Batch
+    API, and returns the appropriate JSON response.
+    """
+    raise ErrorResponse(HTTP_NOT_FOUND)
+
 class requestcontext(object):
     """Holds state/context for an individual request.
 
@@ -376,6 +382,21 @@ 
             except ErrorResponse as inst:
                 return protohandler['handleerror'](inst)
 
+        # Route LFS Batch API requests to the appropriate handler
+
+        if req.env.get(r'HTTP_USER_AGENT', r'').startswith(r'git-lfs/'):
+            try:
+                path = req.env.get(r'PATH_INFO')
+                if path == '/.git/info/lfs/objects/batch':
+                    self.check_perm(rctx, req, None)
+                    return _processlfsbatchreq(rctx.repo, req)
+                else:
+                    raise ErrorResponse(HTTP_NOT_FOUND)
+            except ErrorResponse as inst:
+                req.respond(inst, 'text/plain; charset=utf-8')
+                # No body, since only lfs clients are allowed here
+                return ['']
+
         # translate user-visible url structure to internal structure
 
         args = query.split('/', 2)