Patchwork [01,of,21,V2] hgext: `speedy` - the history query accelerator

login
register
mail settings
Submitter Tomasz Kleczek
Date Dec. 14, 2012, 2:52 a.m.
Message ID <136696ee699f3cfa85fd.1355453533@dev408.prn1.facebook.com>
Download mbox | patch
Permalink /patch/84/
State Deferred, archived
Headers show

Comments

Tomasz Kleczek - Dec. 14, 2012, 2:52 a.m.
# HG changeset patch
# User Tomasz Kleczek <tkleczek at fb.com>
# Date 1355434337 28800
# Node ID 136696ee699f3cfa85fd9e43a95f3ec9f1bc1c74
# Parent  2b79504469cd9448e0e2768acd1a057a527d30aa
hgext: `speedy` - the history query accelerator

This is the first in a series of patches that address the problem
of hg log being slow on a big repo in most cases.


Many history queries have performance linear in the number of commits
or even worse - the size of manifest.

As a result almost every hg log query with non-trivial rev range or
with directory specified is painfully slow on big repositories
and its poor performance doesn't depend on actual output size.

Here are some frequently used commands that just print a handful of
log messages:
  * hg log dir_with_few_commits
  * hg log --rev "user(MyLazyFriend)"

They may take more then 30 secs on a big repo (with couple hundred
thousands of commits).

This extension addresses this problem by introducing a server component
that maintains indices over the history and uses them to respond to queries
in an efficient manner.

What is going on:

* the client component forwards certain history queries to the server
  and waits for a response
* it takes into account that its history may have diverged from
  the server's, and still gives correct answers
* if the server doesn't respond fast enough or crashes, the client falls
  back to computing the answer locally using normal code path
* extension setup time is neglible and there is no overhead for queries
  that cannot be accelerated by the server

The server can be run:
* locally in the same process as client or
* remotely, using a custom protocol over http to communicate

A bunch of commands and their performance with extension enabled:

hg log somedir -l1
  -> 1.8 sec local, 0.8 sec remote

hg log --rev "author(someuser)" -l1
  -> 1.4 sec local, 0.2 sec remote

hg log --rev "date(10/1/2012)" -l1
  -> 1.7 sec local, 0.8 sec remote

hg log "relglob:**.html" -l1
  -> 2.1 sec local, 1.3 sec remote

hg log . -l1
  -> 5.7 sec local, 10.6 sec remote

All commands are run with -l1 option so that displaying output in the
terminal doesn't affect the measurements.  Each commands takes roughly
30 secs to execute without the extension enabled.


This patch introduces support for `author` revset query. More queries
will be added in subsequent patches.
Matt Mackall - Dec. 28, 2012, 11:16 p.m.
On Thu, 2012-12-13 at 18:52 -0800, Tomasz Kleczek wrote:
> # HG changeset patch
> # User Tomasz Kleczek <tkleczek at fb.com>
> # Date 1355434337 28800
> # Node ID 136696ee699f3cfa85fd9e43a95f3ec9f1bc1c74
> # Parent  2b79504469cd9448e0e2768acd1a057a527d30aa
> hgext: `speedy` - the history query accelerator

So I've been kicking this and related ideas around for a while and my
conclusion is that this is not nearly ambitious enough, especially for a
Facebook-scale deployment.

Instead, I think you need to be thinking in terms of a service that also
saves you most of the cost of cloning and pulling... by also managing
storage!

That probably sounds like heresy: isn't the whole point of a distributed
SCM to have no central server? Well, not exactly. The point is really to
have a workflow where synchronization is optional, where working
detached is possible, and no authority is needed to commit. If done
right (ie not monolithic CVS/SVN/Perforce style), you can have a
distributed service that's faster than local disk and scales to
thousands of users.

Google's implementation of a bigtable-based hg backend for Google Code
is a decent proof-of-concept of this. But we should be able to build a
service that plugs into any key/value store (be it bigtable, hadoop,
MySql, etc.) and/or memory cache service with just a little glue.

But I think we need to bang out what that looks like a bit more before
committing to including/supporting something like this implementation.
Augie Fackler - Dec. 31, 2012, 10:22 p.m.
On Dec 28, 2012, at 6:16 PM, Matt Mackall <mpm at selenic.com> wrote:

> On Thu, 2012-12-13 at 18:52 -0800, Tomasz Kleczek wrote:
>> # HG changeset patch
>> # User Tomasz Kleczek <tkleczek at fb.com>
>> # Date 1355434337 28800
>> # Node ID 136696ee699f3cfa85fd9e43a95f3ec9f1bc1c74
>> # Parent  2b79504469cd9448e0e2768acd1a057a527d30aa
>> hgext: `speedy` - the history query accelerator
> 
> So I've been kicking this and related ideas around for a while and my
> conclusion is that this is not nearly ambitious enough, especially for a
> Facebook-scale deployment.
> 
> Instead, I think you need to be thinking in terms of a service that also
> saves you most of the cost of cloning and pulling... by also managing
> storage!
> 
> That probably sounds like heresy: isn't the whole point of a distributed
> SCM to have no central server? Well, not exactly. The point is really to
> have a workflow where synchronization is optional, where working
> detached is possible, and no authority is needed to commit. If done
> right (ie not monolithic CVS/SVN/Perforce style), you can have a
> distributed service that's faster than local disk and scales to
> thousands of users.
> 
> Google's implementation of a bigtable-based hg backend for Google Code
> is a decent proof-of-concept of this. But we should be able to build a
> service that plugs into any key/value store (be it bigtable, hadoop,
> MySql, etc.) and/or memory cache service with just a little glue.

I'd be happy to talk to folks about this, and try and help with the design. Ideally, we'd come up with something that'd make the bigtable-backed design easier to support in the future.

> But I think we need to bang out what that looks like a bit more before
> committing to including/supporting something like this implementation.

I like the idea of an index server for doing revset queries - I've actually been hoping to try and build something along those lines in a generic way to have a stupidly fast history browser, but I've been lacking enough round tuits to make that happen.

I think decoupling the revset index from the actual repository storage layer feels right, but I could be convinced that's crazy.

> 
> -- 
> Mathematics is the supreme nostalgia of our time.
> 
> 
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Bryan O'Sullivan - Jan. 7, 2013, 9:04 p.m.
On Fri, Dec 28, 2012 at 3:16 PM, Matt Mackall <mpm at selenic.com> wrote:

> Instead, I think you need to be thinking in terms of a service that also
> saves you most of the cost of cloning and pulling... by also managing
> storage!
>

We actually are thinking about that, but not yet beyond the level of
"wishfully".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20130107/e497881c/attachment.html>

Patch

diff --git a/hgext/speedy/__init__.py b/hgext/speedy/__init__.py
new file mode 100644
--- /dev/null
+++ b/hgext/speedy/__init__.py
@@ -0,0 +1,10 @@ 
+# Copyright 2012 Facebook
+#
+# This software may be used and distributed according to the terms of the
+# GNU General Public License version 2 or any later version.
+
+import client
+
+def uisetup(ui):
+    if not ui.configbool('speedy', 'server', False):
+        client.uisetup(ui)
diff --git a/hgext/speedy/client.py b/hgext/speedy/client.py
new file mode 100644
--- /dev/null
+++ b/hgext/speedy/client.py
@@ -0,0 +1,33 @@ 
+# Copyright 2012 Facebook
+#
+# This software may be used and distributed according to the terms of the
+# GNU General Public License version 2 or any later version.
+
+from mercurial import extensions, commands
+from mercurial import revset
+
+def patchedauthor(repo, subset, x):
+    """Return the revisions commited by user whose name match x
+
+    Used to monkey patch revset.author function.
+    """
+    # In the subsequent patches here we are going to forward the query
+    # to the server
+    return revset.author(repo, subset, x)
+
+def _speedysetup(ui, repo):
+    """Initialize speedy client."""
+    revset.symbols['author'] = patchedauthor
+
+def uisetup(ui):
+    # Perform patching and most of the initialization inside log wrapper,
+    # as this is only needed if log command is being used
+    initialized = [False]
+    def logwrapper(cmd, *args, **kwargs):
+        repo = args[1]
+        if not initialized[0]:
+            initialized[0] = True
+            _speedysetup(ui, repo)
+        cmd(*args, **kwargs)
+
+    extensions.wrapcommand(commands.table, 'log', logwrapper)
diff --git a/tests/test-speedy.t b/tests/test-speedy.t
new file mode 100644
--- /dev/null
+++ b/tests/test-speedy.t
@@ -0,0 +1,44 @@ 
+Global config file
+  $ cat >> $HGRCPATH <<EOF_END
+  > [ui]
+  > logtemplate = "{desc}\n"
+  > 
+  > [extensions]
+  > speedy=
+  > EOF_END
+
+Preparing local repo
+
+  $ hg init localrepo
+  $ cd localrepo
+
+  $ mkdir d1
+  $ echo chg0 > d1/chg0
+  $ hg commit -Am chg0 -u testuser1
+  adding d1/chg0
+  $ echo chg1 > d1/chg1
+  $ hg commit -Am chg1 -u testuser2 --date "10/20/2012"
+  adding d1/chg1
+  $ echo chg2 > d1/chg2
+  $ hg commit -Am chg2 -u testuser1
+  adding d1/chg2
+  $ mkdir d2
+  $ echo chg3 > d2/chg3.py
+  $ hg commit -Am chg3 -u testuser1
+  adding d2/chg3.py
+  $ echo chg4 > d2/chg4
+  $ hg commit -Am chg4 -u testuser1
+  adding d2/chg4
+  $ echo chg5 > chg5.py
+  $ hg commit -Am chg5 -u testuser1 --date "10/20/2012"
+  adding chg5.py
+
+  $ hg log -r "reverse(user(testuser1))"
+  chg5
+  chg4
+  chg3
+  chg2
+  chg0
+
+  $ hg log -r "author(2)"
+  chg1