Patchwork [01,of,15] hgext: `speedy` - the history query accelerator

login
register
mail settings
Submitter Tomasz Kleczek
Date Dec. 11, 2012, 6:38 p.m.
Message ID <13c6bcb8dd900dc7dbf5.1355251096@dev408.prn1.facebook.com>
Download mbox | patch
Permalink /patch/49/
State Superseded
Headers show

Comments

Tomasz Kleczek - Dec. 11, 2012, 6:38 p.m.
# HG changeset patch
# User Tomasz Kleczek <tkleczek at fb.com>
# Date 1355250659 28800
# Branch stable
# Node ID 13c6bcb8dd900dc7dbf5e3da9ef68d56fed250b3
# Parent  8973e7dd92d5afdeb82e91d5b66934d53a74e8da
hgext: `speedy` - the history query accelerator

This is the first in a series of patches that address the problem
of hg log being slow on a big repo in most cases.


Many history queries have performance linear in the number of commits
or even worse - the size of manifest.

As a result almost every hg log query with non-trivial rev range or
with directory specified is painfully slow on big repositories
and its poor performance doesn't depend on actual output size.

Here are some frequently used commands that just print a handful of
log messages:
  * hg log dir_with_few_commits
  * hg log --rev "user(MyLazyFriend)"

They may take more then 30 secs on a big repo (with couple hundred
thousands of commits).

This extension addresses this problem by introducing a server component
that maintains indices over the history and uses them to respond to queries
in an efficient manner.

What is going on:

* the client component forwards certain history queries to the server
  and waits for a response
* it takes into account that its history may have diverged from
  the server's, and still gives correct answers
* if the server doesn't respond fast enough or crashes, the client falls
  back to computing the answer locally using normal code path
* extension setup time is neglible and there is no overhead for queries
  that cannot be accelerated by the server

The server can be run:
* locally in the same process as client or
* remotely, using a custom protocol over http to communicate

Sample performance gains (while running remote history server):
All commands are run with -l1 option so that displaying output in the
terminal doesn't affect the measurements:

All commands takes roughly 30 secs without the extension enabled.

  hg log somedir -l1
    -> time reduced to 2.4 sec

  hg log --rev "author(someuser)" -l1
    -> time reduced to 1.6 sec

  hg log --rev "date(10/1/2012)" -l1
    -> time reduced to 1.7 sec

  hg log "relglob:**.html" -l1
    -> time reduced to 2.4 sec

  hg log . -l1
    -> time reduced to 11.9 sec

This patch introduces support for `author` revset query. More queries
will be added in subsequent patches.
Tomasz Kleczek - Dec. 11, 2012, 11:28 p.m.
I accidentally based my patches on stable, i'll fix it in V2.


On Tue, Dec 11, 2012 at 10:38 AM, Tomasz Kleczek <tkleczek at fb.com> wrote:

> # HG changeset patch
> # User Tomasz Kleczek <tkleczek at fb.com>
> # Date 1355250659 28800
> # Branch stable
> # Node ID 13c6bcb8dd900dc7dbf5e3da9ef68d56fed250b3
> # Parent  8973e7dd92d5afdeb82e91d5b66934d53a74e8da
> hgext: `speedy` - the history query accelerator
>
> This is the first in a series of patches that address the problem
> of hg log being slow on a big repo in most cases.
>
>
> Many history queries have performance linear in the number of commits
> or even worse - the size of manifest.
>
> As a result almost every hg log query with non-trivial rev range or
> with directory specified is painfully slow on big repositories
> and its poor performance doesn't depend on actual output size.
>
> Here are some frequently used commands that just print a handful of
> log messages:
>   * hg log dir_with_few_commits
>   * hg log --rev "user(MyLazyFriend)"
>
> They may take more then 30 secs on a big repo (with couple hundred
> thousands of commits).
>
> This extension addresses this problem by introducing a server component
> that maintains indices over the history and uses them to respond to queries
> in an efficient manner.
>
> What is going on:
>
> * the client component forwards certain history queries to the server
>   and waits for a response
> * it takes into account that its history may have diverged from
>   the server's, and still gives correct answers
> * if the server doesn't respond fast enough or crashes, the client falls
>   back to computing the answer locally using normal code path
> * extension setup time is neglible and there is no overhead for queries
>   that cannot be accelerated by the server
>
> The server can be run:
> * locally in the same process as client or
> * remotely, using a custom protocol over http to communicate
>
> Sample performance gains (while running remote history server):
> All commands are run with -l1 option so that displaying output in the
> terminal doesn't affect the measurements:
>
> All commands takes roughly 30 secs without the extension enabled.
>
>   hg log somedir -l1
>     -> time reduced to 2.4 sec
>
>   hg log --rev "author(someuser)" -l1
>     -> time reduced to 1.6 sec
>
>   hg log --rev "date(10/1/2012)" -l1
>     -> time reduced to 1.7 sec
>
>   hg log "relglob:**.html" -l1
>     -> time reduced to 2.4 sec
>
>   hg log . -l1
>     -> time reduced to 11.9 sec
>
> This patch introduces support for `author` revset query. More queries
> will be added in subsequent patches.
>
> diff --git a/hgext/speedy/__init__.py b/hgext/speedy/__init__.py
> new file mode 100644
> --- /dev/null
> +++ b/hgext/speedy/__init__.py
> @@ -0,0 +1,10 @@
> +# Copyright 2012 Facebook
> +#
> +# This software may be used and distributed according to the terms of the
> +# GNU General Public License version 2 or any later version.
> +
> +import client
> +
> +def uisetup(ui):
> +    if ui.configbool('speedy', 'client', False):
> +        client.uisetup(ui)
> diff --git a/hgext/speedy/client.py b/hgext/speedy/client.py
> new file mode 100644
> --- /dev/null
> +++ b/hgext/speedy/client.py
> @@ -0,0 +1,33 @@
> +# Copyright 2012 Facebook
> +#
> +# This software may be used and distributed according to the terms of the
> +# GNU General Public License version 2 or any later version.
> +
> +from mercurial import extensions, commands
> +from mercurial import revset
> +
> +def patchedauthor(repo, subset, x):
> +    """Return the revisions commited by user whose name match x
> +
> +    Used to monkey patch revset.author function.
> +    """
> +    # In the subsequent patches here we are going to forward the query
> +    # to the server
> +    return revset.author(repo, subset, x)
> +
> +def _speedysetup(ui, repo):
> +    """Initialize speedy client."""
> +    revset.symbols['author'] = patchedauthor
> +
> +def uisetup(ui):
> +    # Perform patching and most of the initialization inside log wrapper,
> +    # as this is only needed if log command is being used
> +    initialized = [False]
> +    def logwrapper(cmd, *args, **kwargs):
> +        repo = args[1]
> +        if not initialized[0]:
> +            initialized[0] = True
> +            _speedysetup(ui, repo)
> +        cmd(*args, **kwargs)
> +
> +    extensions.wrapcommand(commands.table, 'log', logwrapper)
> diff --git a/tests/test-speedy.t b/tests/test-speedy.t
> new file mode 100644
> --- /dev/null
> +++ b/tests/test-speedy.t
> @@ -0,0 +1,44 @@
> +Global config file
> +  $ cat >> $HGRCPATH <<EOF_END
> +  > [ui]
> +  > logtemplate = "{desc}\n"
> +  >
> +  > [extensions]
> +  > speedy=
> +  > EOF_END
> +
> +Preparing local repo
> +
> +  $ hg init localrepo
> +  $ cd localrepo
> +
> +  $ mkdir d1
> +  $ echo chg0 > d1/chg0
> +  $ hg commit -Am chg0 -u testuser1
> +  adding d1/chg0
> +  $ echo chg1 > d1/chg1
> +  $ hg commit -Am chg1 -u testuser2 --date "10/20/2012"
> +  adding d1/chg1
> +  $ echo chg2 > d1/chg2
> +  $ hg commit -Am chg2 -u testuser1
> +  adding d1/chg2
> +  $ mkdir d2
> +  $ echo chg3 > d2/chg3.py
> +  $ hg commit -Am chg3 -u testuser1
> +  adding d2/chg3.py
> +  $ echo chg4 > d2/chg4
> +  $ hg commit -Am chg4 -u testuser1
> +  adding d2/chg4
> +  $ echo chg5 > chg5.py
> +  $ hg commit -Am chg5 -u testuser1 --date "10/20/2012"
> +  adding chg5.py
> +
> +  $ hg log -r "reverse(user(testuser1))"
> +  chg5
> +  chg4
> +  chg3
> +  chg2
> +  chg0
> +
> +  $ hg log -r "reverse(author(testuser2))"
> +  chg1
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20121211/35c4d5d5/attachment.html>
Idan Kamara - Dec. 12, 2012, 9:51 p.m.
On Tue, Dec 11, 2012 at 8:38 PM, Tomasz Kleczek <tkleczek at fb.com> wrote:
>
> # HG changeset patch
> # User Tomasz Kleczek <tkleczek at fb.com>
> # Date 1355250659 28800
> # Branch stable
> # Node ID 13c6bcb8dd900dc7dbf5e3da9ef68d56fed250b3
> # Parent  8973e7dd92d5afdeb82e91d5b66934d53a74e8da
> hgext: `speedy` - the history query accelerator

Sounds like this thing will be useful to maybe 5% of Mercurial users,
any special reason this needs to go into the main repo? There's enough
going on here to make this an independent project imo.

I haven't looked at what's going on inside but you seem to have put a
lot of effort into this, nice stuff!
Bryan O'Sullivan - Dec. 12, 2012, 9:54 p.m.
On Wed, Dec 12, 2012 at 1:51 PM, Idan Kamara <idankk86 at gmail.com> wrote:

> Sounds like this thing will be useful to maybe 5% of Mercurial users,
> any special reason this needs to go into the main repo?
>

Because that makes it easiest both to use and to improve? There are plenty
of large Mercurial users that can benefit from this, and making it an
out-of-tree extension guarantees that a much smaller proportion of them
ever will.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20121212/4913a537/attachment.html>
Augie Fackler - Dec. 13, 2012, 3:18 a.m.
On Dec 11, 2012, at 12:38 PM, Tomasz Kleczek <tkleczek at fb.com> wrote:

> # HG changeset patch
> # User Tomasz Kleczek <tkleczek at fb.com>
> # Date 1355250659 28800
> # Branch stable
> # Node ID 13c6bcb8dd900dc7dbf5e3da9ef68d56fed250b3
> # Parent  8973e7dd92d5afdeb82e91d5b66934d53a74e8da
> hgext: `speedy` - the history query accelerator

I think I'd name it something more descriptive than 'speedy'. What does it make fast? inotify is 'status'. I'd call it 'fastlog' or something. I don't know that fastlog is /good/, but I'm definitely -1 on the name speedy, as it's just too generic.

That's all I have to say on this patch - I'll keep looking at the rest of the series.
Tomasz Kleczek - Dec. 13, 2012, 7:15 a.m.
+1 for changing the name. I like fastlog, if there no better suggestions,
I'll use this one.


On Wed, Dec 12, 2012 at 7:18 PM, Augie Fackler <raf at durin42.com> wrote:

>
> On Dec 11, 2012, at 12:38 PM, Tomasz Kleczek <tkleczek at fb.com> wrote:
>
> > # HG changeset patch
> > # User Tomasz Kleczek <tkleczek at fb.com>
> > # Date 1355250659 28800
> > # Branch stable
> > # Node ID 13c6bcb8dd900dc7dbf5e3da9ef68d56fed250b3
> > # Parent  8973e7dd92d5afdeb82e91d5b66934d53a74e8da
> > hgext: `speedy` - the history query accelerator
>
> I think I'd name it something more descriptive than 'speedy'. What does it
> make fast? inotify is 'status'. I'd call it 'fastlog' or something. I don't
> know that fastlog is /good/, but I'm definitely -1 on the name speedy, as
> it's just too generic.
>
> That's all I have to say on this patch - I'll keep looking at the rest of
> the series.
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20121212/cc7675f8/attachment.html>
Pierre-Yves David - Dec. 13, 2012, 8:07 p.m.
On Tue, Dec 11, 2012 at 10:38:16AM -0800, Tomasz Kleczek wrote:
> # HG changeset patch
> # User Tomasz Kleczek <tkleczek at fb.com>
> # Date 1355250659 28800
> # Branch stable
> # Node ID 13c6bcb8dd900dc7dbf5e3da9ef68d56fed250b3
> # Parent  8973e7dd92d5afdeb82e91d5b66934d53a74e8da
> hgext: `speedy` - the history query accelerator
> 
> This is the first in a series of patches that address the problem
> of hg log being slow on a big repo in most cases.
> 
> 
> Many history queries have performance linear in the number of commits
> or even worse - the size of manifest.
> 
> As a result almost every hg log query with non-trivial rev range or
> with directory specified is painfully slow on big repositories
> and its poor performance doesn't depend on actual output size.
> 
> Here are some frequently used commands that just print a handful of
> log messages:
>   * hg log dir_with_few_commits
>   * hg log --rev "user(MyLazyFriend)"
> 
> They may take more then 30 secs on a big repo (with couple hundred
> thousands of commits).
> 
> This extension addresses this problem by introducing a server component
> that maintains indices over the history and uses them to respond to queries
> in an efficient manner.
> 
> What is going on:
> 
> * the client component forwards certain history queries to the server
>   and waits for a response
> * it takes into account that its history may have diverged from
>   the server's, and still gives correct answers
> * if the server doesn't respond fast enough or crashes, the client falls
>   back to computing the answer locally using normal code path
> * extension setup time is neglible and there is no overhead for queries
>   that cannot be accelerated by the server
> 
> The server can be run:
> * locally in the same process as client or

Could you includes some numbers about this local version ?

It sound like it could have interresting application within graphical history
viewer like TortoiseHG or hgview. Those qre long lived process that can absorbs
the setup time and benefit from the speed up.
Tomasz Kleczek - Dec. 13, 2012, 9:23 p.m.
Here are the updated results for the local/remote mode  (I am also going
to include them in the commit message).

   hg log somedir -l1
    -> 1.8 sec local, 0.8 sec remote

  hg log --rev "author(someuser)" -l1
    -> 1.4 sec local, 0.2 sec remote

  hg log --rev "date(10/1/2012)" -l1
    -> 1.7 sec local, 0.8 sec remote

  hg log "relglob:**.html" -l1
    -> 2.1 sec local, 1.3 sec remote

  hg log . -l1
    -> 5.7 sec local, 10.6 sec remote


On Thu, Dec 13, 2012 at 12:07 PM, Pierre-Yves David <
pierre-yves.david at ens-lyon.org> wrote:

> On Tue, Dec 11, 2012 at 10:38:16AM -0800, Tomasz Kleczek wrote:
> > # HG changeset patch
> > # User Tomasz Kleczek <tkleczek at fb.com>
> > # Date 1355250659 28800
> > # Branch stable
> > # Node ID 13c6bcb8dd900dc7dbf5e3da9ef68d56fed250b3
> > # Parent  8973e7dd92d5afdeb82e91d5b66934d53a74e8da
> > hgext: `speedy` - the history query accelerator
> >
> > This is the first in a series of patches that address the problem
> > of hg log being slow on a big repo in most cases.
> >
> >
> > Many history queries have performance linear in the number of commits
> > or even worse - the size of manifest.
> >
> > As a result almost every hg log query with non-trivial rev range or
> > with directory specified is painfully slow on big repositories
> > and its poor performance doesn't depend on actual output size.
> >
> > Here are some frequently used commands that just print a handful of
> > log messages:
> >   * hg log dir_with_few_commits
> >   * hg log --rev "user(MyLazyFriend)"
> >
> > They may take more then 30 secs on a big repo (with couple hundred
> > thousands of commits).
> >
> > This extension addresses this problem by introducing a server component
> > that maintains indices over the history and uses them to respond to
> queries
> > in an efficient manner.
> >
> > What is going on:
> >
> > * the client component forwards certain history queries to the server
> >   and waits for a response
> > * it takes into account that its history may have diverged from
> >   the server's, and still gives correct answers
> > * if the server doesn't respond fast enough or crashes, the client falls
> >   back to computing the answer locally using normal code path
> > * extension setup time is neglible and there is no overhead for queries
> >   that cannot be accelerated by the server
> >
> > The server can be run:
> > * locally in the same process as client or
>
> Could you includes some numbers about this local version ?
>
> It sound like it could have interresting application within graphical
> history
> viewer like TortoiseHG or hgview. Those qre long lived process that can
> absorbs
> the setup time and benefit from the speed up.
>
> --
> Pierre-Yves David
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20121213/9d45579c/attachment.html>

Patch

diff --git a/hgext/speedy/__init__.py b/hgext/speedy/__init__.py
new file mode 100644
--- /dev/null
+++ b/hgext/speedy/__init__.py
@@ -0,0 +1,10 @@ 
+# Copyright 2012 Facebook
+#
+# This software may be used and distributed according to the terms of the
+# GNU General Public License version 2 or any later version.
+
+import client
+
+def uisetup(ui):
+    if ui.configbool('speedy', 'client', False):
+        client.uisetup(ui)
diff --git a/hgext/speedy/client.py b/hgext/speedy/client.py
new file mode 100644
--- /dev/null
+++ b/hgext/speedy/client.py
@@ -0,0 +1,33 @@ 
+# Copyright 2012 Facebook
+#
+# This software may be used and distributed according to the terms of the
+# GNU General Public License version 2 or any later version.
+
+from mercurial import extensions, commands
+from mercurial import revset
+
+def patchedauthor(repo, subset, x):
+    """Return the revisions commited by user whose name match x
+
+    Used to monkey patch revset.author function.
+    """
+    # In the subsequent patches here we are going to forward the query
+    # to the server
+    return revset.author(repo, subset, x)
+
+def _speedysetup(ui, repo):
+    """Initialize speedy client."""
+    revset.symbols['author'] = patchedauthor
+
+def uisetup(ui):
+    # Perform patching and most of the initialization inside log wrapper,
+    # as this is only needed if log command is being used
+    initialized = [False]
+    def logwrapper(cmd, *args, **kwargs):
+        repo = args[1]
+        if not initialized[0]:
+            initialized[0] = True
+            _speedysetup(ui, repo)
+        cmd(*args, **kwargs)
+
+    extensions.wrapcommand(commands.table, 'log', logwrapper)
diff --git a/tests/test-speedy.t b/tests/test-speedy.t
new file mode 100644
--- /dev/null
+++ b/tests/test-speedy.t
@@ -0,0 +1,44 @@ 
+Global config file
+  $ cat >> $HGRCPATH <<EOF_END
+  > [ui]
+  > logtemplate = "{desc}\n"
+  > 
+  > [extensions]
+  > speedy=
+  > EOF_END
+
+Preparing local repo
+
+  $ hg init localrepo
+  $ cd localrepo
+
+  $ mkdir d1
+  $ echo chg0 > d1/chg0
+  $ hg commit -Am chg0 -u testuser1
+  adding d1/chg0
+  $ echo chg1 > d1/chg1
+  $ hg commit -Am chg1 -u testuser2 --date "10/20/2012"
+  adding d1/chg1
+  $ echo chg2 > d1/chg2
+  $ hg commit -Am chg2 -u testuser1
+  adding d1/chg2
+  $ mkdir d2
+  $ echo chg3 > d2/chg3.py
+  $ hg commit -Am chg3 -u testuser1
+  adding d2/chg3.py
+  $ echo chg4 > d2/chg4
+  $ hg commit -Am chg4 -u testuser1
+  adding d2/chg4
+  $ echo chg5 > chg5.py
+  $ hg commit -Am chg5 -u testuser1 --date "10/20/2012"
+  adding chg5.py
+
+  $ hg log -r "reverse(user(testuser1))"
+  chg5
+  chg4
+  chg3
+  chg2
+  chg0
+
+  $ hg log -r "reverse(author(testuser2))"
+  chg1