Patchwork [4,of,6] hgweb: introduce staticimmutable web command

login
register
mail settings
Submitter Gregory Szorc
Date April 1, 2017, 7:29 a.m.
Message ID <0b8be3d244585f5a2874.1491031748@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/19890/
State Changes Requested
Headers show

Comments

Gregory Szorc - April 1, 2017, 7:29 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1491021501 25200
#      Fri Mar 31 21:38:21 2017 -0700
# Node ID 0b8be3d244585f5a2874821418fce41bf7631f6c
# Parent  4cec8e88d09775ee6478e307e9dde94af5b9fcfd
hgweb: introduce staticimmutable web command

Currently, hgweb serves HTTP responses with an Etag header
whose value is "W/<mtime>" where <mtime> is the modification time
of the repo being accessed. The "W" means it is a "Weak validator" -
something that is an approximation of content but not an identifier
of unique content. Assuming the client is smart and can cache
responses (like web browsers), a subsequent HTTP request for this
URL will send an If-None-Match request header with the value from
the previous Etag response header. hgweb will compare the INM
header against what it would serve for Etag and if they match, the
server fast issues an HTTP 304 Not Modified to prevent serving
data to the client.

This is a cache hit and is better than no cache hit. But there is
overhead from the client sending the HTTP request and waiting for
the 304 response. Furthermore, the "weak validator" isn't precise
and can lead to weirdness. For example, https://hg.mozilla.org/
consists of a pool of servers behind a load balancer. Each server
has an independent clone of repositories (there is no shared
filesystem). Therefore, the mtimes on files within repositories
may be different. This can result in the Etag value differing
between servers. This can result in HTTP 304 not being issued
and an HTTP 200 being served instead. And this translates to longer
page loads. This isn't a theoretical problem: HTTP requests for
static assets against hg.mozilla.org result in HTTP 200 instead of
HTTP 304 most of the time.

There are more effective ways to perform content caching with HTTP.

One trick is to use "immutable" URLs. Essentially, the content at
a URL is constant for the lifetime of the URL. When this URL is
served, the HTTP response basically says "cache me forever." There's
no conditional HTTP request on next access: if the client has that
URL cached, it just uses it no questions asked. This avoids the
network round trip and can result in drastically faster page loads.

This commit introduces a new web command for processing requests
for "immutable" static assets. Where the URL path pattern for a
"static" web command is simply <path>, the URL path pattern for a
"staticimmutable" web command is "<hash>/<path>" where <hash> is
derived from content. This makes <hash> a "strong" validator in the
HTTP RFC sense and allows us to leverage aggressive caching
techniques. Strictly speaking, the order in the URL violates
REST best practices: ideally <hash> would come after <path>.
However, a lot of tools (including browser devtools) use the
final path component to label a resource. A hash is not a useful
label. So, we invert the order so the final URL path component
is friendly. This makes no difference to caching behavior. But it
will upset REST zealots in their ivory towers. To them, I saw
practically wins over purity.

The behavior of "staticimmutable" after this commit is basically
the same as "static." Future commits will implement the content
hashing and caching features.
Matt Harbison - April 4, 2017, 1:13 a.m.
On Sat, 01 Apr 2017 03:29:08 -0400, Gregory Szorc  
<gregory.szorc@gmail.com> wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1491021501 25200
> #      Fri Mar 31 21:38:21 2017 -0700
> # Node ID 0b8be3d244585f5a2874821418fce41bf7631f6c
> # Parent  4cec8e88d09775ee6478e307e9dde94af5b9fcfd
> hgweb: introduce staticimmutable web command

Should these caching related patches be sensitive to '--config  
web.cache=False'?  I can't imagine why you wouldn't want to cache in a  
production environment.  It has been useful for development testing, but  
it's not documented as experimental/developer only.
Gregory Szorc - April 4, 2017, 1:25 a.m.
On Mon, Apr 3, 2017 at 6:13 PM, Matt Harbison <mharbison72@gmail.com> wrote:

> On Sat, 01 Apr 2017 03:29:08 -0400, Gregory Szorc <gregory.szorc@gmail.com>
> wrote:
>
> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1491021501 25200
>> #      Fri Mar 31 21:38:21 2017 -0700
>> # Node ID 0b8be3d244585f5a2874821418fce41bf7631f6c
>> # Parent  4cec8e88d09775ee6478e307e9dde94af5b9fcfd
>> hgweb: introduce staticimmutable web command
>>
>
> Should these caching related patches be sensitive to '--config
> web.cache=False'?  I can't imagine why you wouldn't want to cache in a
> production environment.  It has been useful for development testing, but
> it's not documented as experimental/developer only.
>

Good question.

Since immutable content is, well, immutable, caching is like an intrinsic
property that follows from being immutable. So I can make the argument that
Cache-Control for those URLs should always be emitted.

What we could do is have the templating layer respect web.cache. e.g. if
web.cache=false, the template filter for resolving the URL/path would emit
/static/<path> instead of /staticimmutable/...

Also, with a little work, we could likely improve the web caching to the
point it is useful in a dev environment too. The caching strategy today
uses the repo mtime for cache invalidation. Since template files are
independent of repositories, this caching strategy is just plain wrong. We
need to take the actual template source files into consideration when
checking cache validity. We're lucky more people aren't complaining about
this bug. Come to think of it, this is likely responsible for some weird
behavior I've seen on hg.mozilla.org over the years! So maybe I'll just fix
it...
Yuya Nishihara - April 5, 2017, 3:33 p.m.
On Sat, 01 Apr 2017 00:29:08 -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1491021501 25200
> #      Fri Mar 31 21:38:21 2017 -0700
> # Node ID 0b8be3d244585f5a2874821418fce41bf7631f6c
> # Parent  4cec8e88d09775ee6478e307e9dde94af5b9fcfd
> hgweb: introduce staticimmutable web command
> 
> Currently, hgweb serves HTTP responses with an Etag header
> whose value is "W/<mtime>" where <mtime> is the modification time
> of the repo being accessed. The "W" means it is a "Weak validator" -
> something that is an approximation of content but not an identifier
> of unique content. Assuming the client is smart and can cache
> responses (like web browsers), a subsequent HTTP request for this
> URL will send an If-None-Match request header with the value from
> the previous Etag response header. hgweb will compare the INM
> header against what it would serve for Etag and if they match, the
> server fast issues an HTTP 304 Not Modified to prevent serving
> data to the client.
> 
> This is a cache hit and is better than no cache hit. But there is
> overhead from the client sending the HTTP request and waiting for
> the 304 response. Furthermore, the "weak validator" isn't precise
> and can lead to weirdness. For example, https://hg.mozilla.org/
> consists of a pool of servers behind a load balancer. Each server
> has an independent clone of repositories (there is no shared
> filesystem). Therefore, the mtimes on files within repositories
> may be different. This can result in the Etag value differing
> between servers. This can result in HTTP 304 not being issued
> and an HTTP 200 being served instead. And this translates to longer
> page loads. This isn't a theoretical problem: HTTP requests for
> static assets against hg.mozilla.org result in HTTP 200 instead of
> HTTP 304 most of the time.
> 
> There are more effective ways to perform content caching with HTTP.
> 
> One trick is to use "immutable" URLs. Essentially, the content at
> a URL is constant for the lifetime of the URL. When this URL is
> served, the HTTP response basically says "cache me forever." There's
> no conditional HTTP request on next access: if the client has that
> URL cached, it just uses it no questions asked. This avoids the
> network round trip and can result in drastically faster page loads.
> 
> This commit introduces a new web command for processing requests
> for "immutable" static assets. Where the URL path pattern for a
> "static" web command is simply <path>, the URL path pattern for a
> "staticimmutable" web command is "<hash>/<path>" where <hash> is
> derived from content. This makes <hash> a "strong" validator in the
> HTTP RFC sense and allows us to leverage aggressive caching
> techniques. Strictly speaking, the order in the URL violates
> REST best practices: ideally <hash> would come after <path>.
> However, a lot of tools (including browser devtools) use the
> final path component to label a resource. A hash is not a useful
> label. So, we invert the order so the final URL path component
> is friendly. This makes no difference to caching behavior. But it
> will upset REST zealots in their ivory towers. To them, I saw
> practically wins over purity.

If I understand your proposal, a common trick is to append "?<hash>" to
the URL so we can still offload static contents to the frontend server.

Can't we compute the <hash> from Mercurial version (and maybe some config
values)?
Yuya Nishihara - April 5, 2017, 3:43 p.m.
On Thu, 6 Apr 2017 00:33:23 +0900, Yuya Nishihara wrote:
> On Sat, 01 Apr 2017 00:29:08 -0700, Gregory Szorc wrote:
> > # HG changeset patch
> > # User Gregory Szorc <gregory.szorc@gmail.com>
> > # Date 1491021501 25200
> > #      Fri Mar 31 21:38:21 2017 -0700
> > # Node ID 0b8be3d244585f5a2874821418fce41bf7631f6c
> > # Parent  4cec8e88d09775ee6478e307e9dde94af5b9fcfd
> > hgweb: introduce staticimmutable web command

> > One trick is to use "immutable" URLs. Essentially, the content at
> > a URL is constant for the lifetime of the URL. When this URL is
> > served, the HTTP response basically says "cache me forever." There's
> > no conditional HTTP request on next access: if the client has that
> > URL cached, it just uses it no questions asked. This avoids the
> > network round trip and can result in drastically faster page loads.
> > 
> > This commit introduces a new web command for processing requests
> > for "immutable" static assets. Where the URL path pattern for a
> > "static" web command is simply <path>, the URL path pattern for a
> > "staticimmutable" web command is "<hash>/<path>" where <hash> is
> > derived from content. This makes <hash> a "strong" validator in the
> > HTTP RFC sense and allows us to leverage aggressive caching
> > techniques. Strictly speaking, the order in the URL violates
> > REST best practices: ideally <hash> would come after <path>.
> > However, a lot of tools (including browser devtools) use the
> > final path component to label a resource. A hash is not a useful
> > label. So, we invert the order so the final URL path component
> > is friendly. This makes no difference to caching behavior. But it
> > will upset REST zealots in their ivory towers. To them, I saw
> > practically wins over purity.
> 
> If I understand your proposal, a common trick is to append "?<hash>" to
> the URL so we can still offload static contents to the frontend server.

Never mind. /<hash>/<path> can be mapped by regexp location, and there might
be a browser that ignores cache-control headers if ?query string exists.

> Can't we compute the <hash> from Mercurial version (and maybe some config
> values)?
Gregory Szorc - April 13, 2017, 1:50 a.m.
On Wed, Apr 5, 2017 at 8:43 AM, Yuya Nishihara <yuya@tcha.org> wrote:

> On Thu, 6 Apr 2017 00:33:23 +0900, Yuya Nishihara wrote:
> > On Sat, 01 Apr 2017 00:29:08 -0700, Gregory Szorc wrote:
> > > # HG changeset patch
> > > # User Gregory Szorc <gregory.szorc@gmail.com>
> > > # Date 1491021501 25200
> > > #      Fri Mar 31 21:38:21 2017 -0700
> > > # Node ID 0b8be3d244585f5a2874821418fce41bf7631f6c
> > > # Parent  4cec8e88d09775ee6478e307e9dde94af5b9fcfd
> > > hgweb: introduce staticimmutable web command
>
> > > One trick is to use "immutable" URLs. Essentially, the content at
> > > a URL is constant for the lifetime of the URL. When this URL is
> > > served, the HTTP response basically says "cache me forever." There's
> > > no conditional HTTP request on next access: if the client has that
> > > URL cached, it just uses it no questions asked. This avoids the
> > > network round trip and can result in drastically faster page loads.
> > >
> > > This commit introduces a new web command for processing requests
> > > for "immutable" static assets. Where the URL path pattern for a
> > > "static" web command is simply <path>, the URL path pattern for a
> > > "staticimmutable" web command is "<hash>/<path>" where <hash> is
> > > derived from content. This makes <hash> a "strong" validator in the
> > > HTTP RFC sense and allows us to leverage aggressive caching
> > > techniques. Strictly speaking, the order in the URL violates
> > > REST best practices: ideally <hash> would come after <path>.
> > > However, a lot of tools (including browser devtools) use the
> > > final path component to label a resource. A hash is not a useful
> > > label. So, we invert the order so the final URL path component
> > > is friendly. This makes no difference to caching behavior. But it
> > > will upset REST zealots in their ivory towers. To them, I saw
> > > practically wins over purity.
> >
> > If I understand your proposal, a common trick is to append "?<hash>" to
> > the URL so we can still offload static contents to the frontend server.
>
> Never mind. /<hash>/<path> can be mapped by regexp location, and there
> might
> be a browser that ignores cache-control headers if ?query string exists.
>

I think browsers are better about this these days. But it used to be common
for caching to be disabled on URLs with query strings because servers and
clients were so buggy. Never mind that the HTTP specification says the
query string is part of the URL and therefore part of the cache key. Things
do get a bit ambiguous when the order of arguments in the query string
varies. Is that the same entity or not? These days, I try to not use query
strings so these uncertainties don't come into play.


> > Can't we compute the <hash> from Mercurial version (and maybe some config
> > values)?
>

Patch

diff --git a/mercurial/hgweb/common.py b/mercurial/hgweb/common.py
--- a/mercurial/hgweb/common.py
+++ b/mercurial/hgweb/common.py
@@ -10,6 +10,7 @@  from __future__ import absolute_import
 
 import base64
 import errno
+import hashlib
 import mimetypes
 import os
 import uuid
@@ -146,7 +147,7 @@  def ispathsafe(path):
 
     return True
 
-def staticfile(directory, fname, req):
+def staticfile(directory, fname, req, immutablehash=None):
     """return a file inside directory with guessed Content-Type header
 
     fname always uses '/' as directory separator and isn't allowed to
@@ -171,6 +172,15 @@  def staticfile(directory, fname, req):
         with open(path, 'rb') as fh:
             data = fh.read()
 
+        # Immutable request but the hash does not match found content.
+        # This should not happen. In the rare case it does, 404 the request
+        # because the requested content does not match what we have.
+        if immutablehash:
+            hash = hashlib.sha1(data).hexdigest()
+
+            if hash != immutablehash:
+                raise ErrorResponse(HTTP_NOT_FOUND)
+
         req.respond(HTTP_OK, ct, body=data)
     except TypeError:
         raise ErrorResponse(HTTP_SERVER_ERROR, 'illegal filename')
diff --git a/mercurial/hgweb/hgweb_mod.py b/mercurial/hgweb/hgweb_mod.py
--- a/mercurial/hgweb/hgweb_mod.py
+++ b/mercurial/hgweb/hgweb_mod.py
@@ -389,7 +389,7 @@  class hgweb(object):
             if util.safehasattr(webcommands, cmd):
                 req.form['cmd'] = [cmd]
 
-            if cmd == 'static':
+            if cmd in ('static', 'staticimmutable'):
                 req.form['file'] = ['/'.join(args)]
             else:
                 if args and args[0]:
@@ -418,7 +418,7 @@  class hgweb(object):
             ctype = templater.stringify(ctype)
 
             # check read permissions non-static content
-            if cmd != 'static':
+            if cmd not in ('static', 'staticimmutable'):
                 self.check_perm(rctx, req, None)
 
             if cmd == '':
diff --git a/mercurial/hgweb/webcommands.py b/mercurial/hgweb/webcommands.py
--- a/mercurial/hgweb/webcommands.py
+++ b/mercurial/hgweb/webcommands.py
@@ -1139,6 +1139,23 @@  def static(web, req, tmpl):
     staticfile(static, fname, req)
     return []
 
+@webcommand('staticimmutable')
+def staticimmutable(web, req, tmpl):
+    parts = req.form['file'][0].split('/', 1)
+
+    # Requested path is <path>/<hash>
+    if len(parts) != 2:
+        raise ErrorResponse(HTTP_NOT_FOUND)
+
+    hash, fname = parts
+
+    tp = web.templatepath or templater.templatepaths()
+    if isinstance(tp, str):
+        tp = [tp]
+    static = [os.path.join(p, 'static') for p in tp]
+    staticfile(static, fname, req, immutablehash=hash)
+    return []
+
 @webcommand('graph')
 def graph(web, req, tmpl):
     """
diff --git a/tests/test-hgweb.t b/tests/test-hgweb.t
--- a/tests/test-hgweb.t
+++ b/tests/test-hgweb.t
@@ -106,6 +106,136 @@  should give a 404 - static file that doe
   
   [1]
 
+staticimmutable without all path components should give 404
+
+  $ get-with-headers.py localhost:$HGPORT 'staticimmutable/bogus'
+  404 Not Found
+  
+  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
+  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">
+  <head>
+  <link rel="icon" href="/static/hgicon.png" type="image/png" />
+  <meta name="robots" content="index, nofollow" />
+  <link rel="stylesheet" href="/static/style-paper.css" type="text/css" />
+  <script type="text/javascript" src="/static/mercurial.js"></script>
+  
+  <title>test: error</title>
+  </head>
+  <body>
+  
+  <div class="container">
+  <div class="menu">
+  <div class="logo">
+  <a href="https://mercurial-scm.org/">
+  <img src="/static/hglogo.png" width=75 height=90 border=0 alt="mercurial" /></a>
+  </div>
+  <ul>
+  <li><a href="/shortlog">log</a></li>
+  <li><a href="/graph">graph</a></li>
+  <li><a href="/tags">tags</a></li>
+  <li><a href="/bookmarks">bookmarks</a></li>
+  <li><a href="/branches">branches</a></li>
+  </ul>
+  <ul>
+  <li><a href="/help">help</a></li>
+  </ul>
+  </div>
+  
+  <div class="main">
+  
+  <h2 class="breadcrumb"><a href="/">Mercurial</a> </h2>
+  <h3>error</h3>
+  
+  <form class="search" action="/log">
+  
+  <p><input name="rev" id="search1" type="text" size="30"></p>
+  <div id="hint">Find changesets by keywords (author, files, the commit message), revision
+  number or hash, or <a href="/help/revsets">revset expression</a>.</div>
+  </form>
+  
+  <div class="description">
+  <p>
+  An error occurred while processing your request:
+  </p>
+  <p>
+  Not Found
+  </p>
+  </div>
+  </div>
+  </div>
+  
+  
+  
+  </body>
+  </html>
+  
+  [1]
+
+staticimmutable with a bogus hash should give 404
+
+  $ get-with-headers.py localhost:$HGPORT 'staticimmutable/badhash/mercurial.js'
+  404 Not Found
+  
+  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
+  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">
+  <head>
+  <link rel="icon" href="/static/hgicon.png" type="image/png" />
+  <meta name="robots" content="index, nofollow" />
+  <link rel="stylesheet" href="/static/style-paper.css" type="text/css" />
+  <script type="text/javascript" src="/static/mercurial.js"></script>
+  
+  <title>test: error</title>
+  </head>
+  <body>
+  
+  <div class="container">
+  <div class="menu">
+  <div class="logo">
+  <a href="https://mercurial-scm.org/">
+  <img src="/static/hglogo.png" width=75 height=90 border=0 alt="mercurial" /></a>
+  </div>
+  <ul>
+  <li><a href="/shortlog">log</a></li>
+  <li><a href="/graph">graph</a></li>
+  <li><a href="/tags">tags</a></li>
+  <li><a href="/bookmarks">bookmarks</a></li>
+  <li><a href="/branches">branches</a></li>
+  </ul>
+  <ul>
+  <li><a href="/help">help</a></li>
+  </ul>
+  </div>
+  
+  <div class="main">
+  
+  <h2 class="breadcrumb"><a href="/">Mercurial</a> </h2>
+  <h3>error</h3>
+  
+  <form class="search" action="/log">
+  
+  <p><input name="rev" id="search1" type="text" size="30"></p>
+  <div id="hint">Find changesets by keywords (author, files, the commit message), revision
+  number or hash, or <a href="/help/revsets">revset expression</a>.</div>
+  </form>
+  
+  <div class="description">
+  <p>
+  An error occurred while processing your request:
+  </p>
+  <p>
+  Not Found
+  </p>
+  </div>
+  </div>
+  </div>
+  
+  
+  
+  </body>
+  </html>
+  
+  [1]
+
 should give a 404 - bad revision
 
   $ get-with-headers.py localhost:$HGPORT 'file/spam/foo?style=raw'
@@ -331,7 +461,7 @@  stop and restart
 Test the access/error files are opened in append mode
 
   $ $PYTHON -c "print len(file('access.log').readlines()), 'log lines written'"
-  14 log lines written
+  16 log lines written
 
 static file