Patchwork log: make file log slow path usable on large repos

login
register
mail settings
Submitter Durham Goode
Date Sept. 11, 2013, 2:52 a.m.
Message ID <73303bff81aaa188c011.1378867929@dev350.prn1.facebook.com>
Download mbox | patch
Permalink /patch/2421/
State Accepted
Commit d184bae667e41d953648aa3c2641a3608c27f45a
Headers show

Comments

Durham Goode - Sept. 11, 2013, 2:52 a.m.
# HG changeset patch
# User Durham Goode <durham@fb.com>
# Date 1378867774 25200
#      Tue Sep 10 19:49:34 2013 -0700
# Node ID 73303bff81aaa188c0113791f0147f9e13a9eef8
# Parent  c821473150d2cc0c75c74a0292e9a060eb19bc36
log: make file log slow path usable on large repos

Running "hg log <pattern or directory>" on large repos took a very, very long
time because it first read ctx.files() for every commit before even starting to
process the results.

This change makes the ctx.files() check lazy, which makes the command start
producing results immediately.
Nikolaj Sjujskij - Sept. 11, 2013, 6:25 a.m.
> # HG changeset patch
> # User Durham Goode <durham@fb.com>
> # Date 1378867774 25200
> #      Tue Sep 10 19:49:34 2013 -0700
> # Node ID 73303bff81aaa188c0113791f0147f9e13a9eef8
> # Parent  c821473150d2cc0c75c74a0292e9a060eb19bc36
> log: make file log slow path usable on large repos
>
> Running "hg log <pattern or directory>" on large repos took a very, very  
> long
> time because it first read ctx.files() for every commit before even  
> starting to
> process the results.
>
> This change makes the ctx.files() check lazy, which makes the command  
> start producing results immediately.
Sounds like somehow related to issue 4002 ("hg log FILE is slow") [1]
Probably would help issue 3948 ("`log --patch` is very slow")[2] as well

[1]: http://bz.selenic.com/show_bug.cgi?id=4002
[2]: http://bz.selenic.com/show_bug.cgi?id=3948
Durham Goode - Sept. 11, 2013, 5:46 p.m.
On 9/10/13 11:25 PM, "Nikolaj Sjujskij" <sterkrig@myopera.com> wrote:

>> # HG changeset patch
>> # User Durham Goode <durham@fb.com>
>> # Date 1378867774 25200
>> #      Tue Sep 10 19:49:34 2013 -0700
>> # Node ID 73303bff81aaa188c0113791f0147f9e13a9eef8
>> # Parent  c821473150d2cc0c75c74a0292e9a060eb19bc36
>> log: make file log slow path usable on large repos
>>
>> Running "hg log <pattern or directory>" on large repos took a very,
>>very  
>> long
>> time because it first read ctx.files() for every commit before even
>> starting to
>> process the results.
>>
>> This change makes the ctx.files() check lazy, which makes the command
>> start producing results immediately.
>Sounds like somehow related to issue 4002 ("hg log FILE is slow") [1]
>Probably would help issue 3948 ("`log --patch` is very slow")[2] as well
>
>[1]: http://bz.selenic.com/show_bug.cgi?id=4002
>[2]: http://bz.selenic.com/show_bug.cgi?id=3948
>

The code I changed only affects hg log when a directory or pattern is
passed.  I don't think those bugs will be helped by this change
unfortunately.
Matt Mackall - Sept. 12, 2013, 2:43 a.m.
On Tue, 2013-09-10 at 19:52 -0700, Durham Goode wrote:
> # HG changeset patch
> # User Durham Goode <durham@fb.com>
> # Date 1378867774 25200
> #      Tue Sep 10 19:49:34 2013 -0700
> # Node ID 73303bff81aaa188c0113791f0147f9e13a9eef8
> # Parent  c821473150d2cc0c75c74a0292e9a060eb19bc36
> log: make file log slow path usable on large repos
>
> Running "hg log <pattern or directory>" on large repos took a very, very long
> time because it first read ctx.files() for every commit before even starting to
> process the results.
> 
> This change makes the ctx.files() check lazy, which makes the command start
> producing results immediately.

For future reference, I desperately want to kill ALL of the old log code
in favor of the faster, cleaner revset-based version that's in the
graphlog code. So continuing to collect legacy log code tweaks make me
sad.

As it happens, the thing blocking dropping the old code is making revset
computation lazy, which would bring this and many other wins.
Alexander Plavin - Sept. 12, 2013, 3:56 p.m.
12.09.2013, 06:44, "Matt Mackall" <mpm@selenic.com>:
> On Tue, 2013-09-10 at 19:52 -0700, Durham Goode wrote:
>
>>  # HG changeset patch
>>  # User Durham Goode <durham@fb.com>
>>  # Date 1378867774 25200
>>  #      Tue Sep 10 19:49:34 2013 -0700
>>  # Node ID 73303bff81aaa188c0113791f0147f9e13a9eef8
>>  # Parent  c821473150d2cc0c75c74a0292e9a060eb19bc36
>>  log: make file log slow path usable on large repos
>>
>>  Running "hg log <pattern or directory>" on large repos took a very, very long
>>  time because it first read ctx.files() for every commit before even starting to
>>  process the results.
>>
>>  This change makes the ctx.files() check lazy, which makes the command start
>>  producing results immediately.
>
> For future reference, I desperately want to kill ALL of the old log code
> in favor of the faster, cleaner revset-based version that's in the
> graphlog code. So continuing to collect legacy log code tweaks make me
> sad.
>
> As it happens, the thing blocking dropping the old code is making revset
> computation lazy, which would bring this and many other wins.

Is it really possible to make all revsets lazy? There are functions like sort, reverse and so on, which need to get the whole input at first. Or you meant only some specific ones?

>
> --
> Mathematics is the supreme nostalgia of our time.
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Matt Mackall - Sept. 12, 2013, 6:45 p.m.
On Thu, 2013-09-12 at 19:56 +0400, Alexander Plavin wrote:
> 
> 12.09.2013, 06:44, "Matt Mackall" <mpm@selenic.com>:
> > On Tue, 2013-09-10 at 19:52 -0700, Durham Goode wrote:
> >
> >>  # HG changeset patch
> >>  # User Durham Goode <durham@fb.com>
> >>  # Date 1378867774 25200
> >>  #      Tue Sep 10 19:49:34 2013 -0700
> >>  # Node ID 73303bff81aaa188c0113791f0147f9e13a9eef8
> >>  # Parent  c821473150d2cc0c75c74a0292e9a060eb19bc36
> >>  log: make file log slow path usable on large repos
> >>
> >>  Running "hg log <pattern or directory>" on large repos took a very, very long
> >>  time because it first read ctx.files() for every commit before even starting to
> >>  process the results.
> >>
> >>  This change makes the ctx.files() check lazy, which makes the command start
> >>  producing results immediately.
> >
> > For future reference, I desperately want to kill ALL of the old log code
> > in favor of the faster, cleaner revset-based version that's in the
> > graphlog code. So continuing to collect legacy log code tweaks make me
> > sad.
> >
> > As it happens, the thing blocking dropping the old code is making revset
> > computation lazy, which would bring this and many other wins.
> 
> Is it really possible to make all revsets lazy? 

No. Some past discussion:

http://markmail.org/message/krsiby5f375ynssu

Most of the interesting parts of revsets can in fact be lazy.
Durham Goode - Sept. 12, 2013, 7:38 p.m.
On 9/11/13 7:43 PM, "Matt Mackall" <mpm@selenic.com> wrote:

>On Tue, 2013-09-10 at 19:52 -0700, Durham Goode wrote:
>> # HG changeset patch
>> # User Durham Goode <durham@fb.com>
>> # Date 1378867774 25200
>> #      Tue Sep 10 19:49:34 2013 -0700
>> # Node ID 73303bff81aaa188c0113791f0147f9e13a9eef8
>> # Parent  c821473150d2cc0c75c74a0292e9a060eb19bc36
>> log: make file log slow path usable on large repos
>>
>> Running "hg log <pattern or directory>" on large repos took a very,
>>very long
>> time because it first read ctx.files() for every commit before even
>>starting to
>> process the results.
>> 
>> This change makes the ctx.files() check lazy, which makes the command
>>start
>> producing results immediately.
>
>For future reference, I desperately want to kill ALL of the old log code
>in favor of the faster, cleaner revset-based version that's in the
>graphlog code. So continuing to collect legacy log code tweaks make me
>sad.
>
>As it happens, the thing blocking dropping the old code is making revset
>computation lazy, which would bring this and many other wins.

I'll take a look at that next time I'm in the area. This is patch is a
pretty significant win for us though, so I'd like to get it in anyway.
Matt Mackall - Sept. 16, 2013, 7:57 p.m.
On Tue, 2013-09-10 at 19:52 -0700, Durham Goode wrote:
> # HG changeset patch
> # User Durham Goode <durham@fb.com>
> # Date 1378867774 25200
> #      Tue Sep 10 19:49:34 2013 -0700
> # Node ID 73303bff81aaa188c0113791f0147f9e13a9eef8
> # Parent  c821473150d2cc0c75c74a0292e9a060eb19bc36
> log: make file log slow path usable on large repos

Queued for default, thanks.

Patch

diff --git a/mercurial/cmdutil.py b/mercurial/cmdutil.py
--- a/mercurial/cmdutil.py
+++ b/mercurial/cmdutil.py
@@ -1172,12 +1172,34 @@ 
                                'filenames'))
 
         # The slow path checks files modified in every changeset.
-        for i in sorted(revs):
-            ctx = change(i)
-            matches = filter(match, ctx.files())
-            if matches:
-                fncache[i] = matches
-                wanted.add(i)
+        # This is really slow on large repos, so compute the set lazily.
+        class lazywantedset(object):
+            def __init__(self):
+                self.set = set()
+                self.revs = set(revs)
+
+            # No need to worry about locality here because it will be accessed
+            # in the same order as the increasing window below.
+            def __contains__(self, value):
+                if value in self.set:
+                    return True
+                elif not value in self.revs:
+                    return False
+                else:
+                    self.revs.discard(value)
+                    ctx = change(value)
+                    matches = filter(match, ctx.files())
+                    if matches:
+                        fncache[value] = matches
+                        self.set.add(value)
+                        return True
+                    return False
+
+            def discard(self, value):
+                self.revs.discard(value)
+                self.set.discard(value)
+
+        wanted = lazywantedset()
 
     class followfilter(object):
         def __init__(self, onlyfirst=False):