Patchwork [3,of,5,V2] ancestor: add a class to generate ancestors lazily

login
register
mail settings
Submitter Siddharth Agarwal
Date Dec. 15, 2012, 12:13 a.m.
Message ID <a00028bfa9eeb2e39b22.1355530433@sid0x220>
Download mbox | patch
Permalink /patch/115/
State Superseded
Headers show

Comments

Siddharth Agarwal - Dec. 15, 2012, 12:13 a.m.
# HG changeset patch
# User Siddharth Agarwal <sid0 at fb.com>
# Date 1355528919 28800
# Node ID a00028bfa9eeb2e39b2256e25792695463aa22d8
# Parent  35eb8d05863f3907b5bfd5f023a4746f91ef97d8
ancestor: add a class to generate ancestors lazily

In a linear repository with over 400,000 commits, hg perfancestorset takes
0.80 seconds no matter how far behind we're looking. With this patch, hg
perfancestorset -- X takes:

    Rev X       Time
       -1      0.00s
    -4000      0.01s
   -20000      0.04s
   -80000      0.17s
  -200000      0.43s
  -300000      0.69s
        0      0.88s

Thus, for revisions close to tip, we're up to several orders of magnitude
faster. At 0 we're around 10% slower.
Siddharth Agarwal - Dec. 15, 2012, 1:25 a.m.
On 12/14/2012 04:13 PM, Siddharth Agarwal wrote:
> +class lazyancestorset(object):
> +    def __init__(self, repo, revs, increvs, stoprev=0):

I just realized Kevin asked me to pass in a changelog here, and I passed 
in a repo instead. Passing in repos seems to be a bit more common in 
general, but either way's fine with me.

The docstring refers to pfunc instead of repo -- I've fixed that locally.
Kevin Bullock - Dec. 15, 2012, 4:26 a.m.
On 14 Dec 2012, at 7:25 PM, Siddharth Agarwal wrote:

> On 12/14/2012 04:13 PM, Siddharth Agarwal wrote:
>> +class lazyancestorset(object):
>> +    def __init__(self, repo, revs, increvs, stoprev=0):
> 
> I just realized Kevin asked me to pass in a changelog here, and I passed in a repo instead. Passing in repos seems to be a bit more common in general, but either way's fine with me.
> 
> The docstring refers to pfunc instead of repo -- I've fixed that locally.

I'd lean toward passing in the changelog directly, but I'll also defer to those who've been in this codebase longer than I have. Others have opinions?

If others are cool with this, I'd take a pull request with the docstring fixed up and changing repo -> changelog, instead of having to patchbomb the list again.

pacem in terris / ??? / ?????? / ????????? / ??
Kevin R. Bullock

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20121214/6f514ece/attachment.html>
Augie Fackler - Dec. 17, 2012, 1:02 a.m.
On Dec 14, 2012, at 11:26 PM, Kevin Bullock <kbullock+mercurial at ringworld.org> wrote:

> On 14 Dec 2012, at 7:25 PM, Siddharth Agarwal wrote:
> 
>> On 12/14/2012 04:13 PM, Siddharth Agarwal wrote:
>>> +class lazyancestorset(object):
>>> +    def __init__(self, repo, revs, increvs, stoprev=0):
>> 
>> I just realized Kevin asked me to pass in a changelog here, and I passed in a repo instead. Passing in repos seems to be a bit more common in general, but either way's fine with me.
>> 
>> The docstring refers to pfunc instead of repo -- I've fixed that locally.
> 
> I'd lean toward passing in the changelog directly, but I'll also defer to those who've been in this codebase longer than I have. Others have opinions?
> 

I think I'd lean that way too, but only slightly.
Pierre-Yves David - Dec. 17, 2012, 6:31 p.m.
On Fri, Dec 14, 2012 at 04:13:53PM -0800, Siddharth Agarwal wrote:
> # HG changeset patch
> # User Siddharth Agarwal <sid0 at fb.com>
> # Date 1355528919 28800
> # Node ID a00028bfa9eeb2e39b2256e25792695463aa22d8
> # Parent  35eb8d05863f3907b5bfd5f023a4746f91ef97d8
> ancestor: add a class to generate ancestors lazily
> 
> In a linear repository with over 400,000 commits, hg perfancestorset takes
> 0.80 seconds no matter how far behind we're looking. With this patch, hg
> perfancestorset -- X takes:
> 
>     Rev X       Time
>        -1      0.00s
>     -4000      0.01s
>    -20000      0.04s
>    -80000      0.17s
>   -200000      0.43s
>   -300000      0.69s
>         0      0.88s
> 
> Thus, for revisions close to tip, we're up to several orders of magnitude
> faster. At 0 we're around 10% slower.

Looks promising.

But I'm not very happy to see that as "yes another different function".

Isn't it possible to alter the original ancestors function for the same purpose?

Can't we apply this kind of of speedup to the revset version too an use
it every where?

> diff -r 35eb8d05863f -r a00028bfa9ee mercurial/ancestor.py
> --- a/mercurial/ancestor.py	Fri Dec 14 10:23:18 2012 -0800
> +++ b/mercurial/ancestor.py	Fri Dec 14 15:48:39 2012 -0800
> @@ -316,3 +316,49 @@
>  
>      missing.reverse()
>      return missing
> +
> +class lazyancestorset(object):
> +    def __init__(self, repo, revs, increvs, stoprev=0):
> +        """Create a new set-like object generating ancestors for the given
> +        revs.
> +
> +        This is computed lazily starting from tip and working in descending rev
> +        order. The set only supports the membership operation.

You are starting from <revs> not tip isn't it ?

> +        revs should be an iterable. pfunc must return a list of parent revs for
> +        a given rev. increvs is a boolean that indicates whether revs should be
> +        included in the set. 

increvs a bit misleading. I read it a "incremental revs".
I would name it "inclusive".

I would also use a default to True to simplify signature.

>+                               Revs lower than stoprev will not be generated."""

<stoprev> is never used in the code actual user code. Is that not
possible.

> +        # Python's heap is a min-heap. Multiply all values by -1 to convert it
> +        # into a max-heap.
> +        self._parentrevs = repo.changelog.parentrevs

A changelog argument would be better here in my opinion.

You should highlight why you are using a heapq instead of a deque here.


> +        self._visit = [-rev for rev in revs]
> +        heapq.heapify(self._visit)
> +        if increvs:
> +            self._seen = set(revs)
> +        else:
> +            self._seen = set()
> +        self._stoprev = stoprev
> +
> +    def __contains__(self, target):
> +        seen = self._seen
> +        if target in seen:
> +            return True
> +
> +        parentrevs = self._parentrevs
> +        visit = self._visit
> +        stoprev = self._stoprev
> +        heappop = heapq.heappop
> +        heappush = heapq.heappush
> +
> +        targetseen = False
> +
> +        while visit and -visit[0] >= target and not targetseen:

could be -visit[0] > target every part of visit already have been
checked agains target.

> +            for parent in parentrevs(-heappop(visit)):
> +                if parent in seen or parent < stoprev:

(switching the two conditional sound like a possible micro optimistation.)

> +                    continue
> +                heappush(visit, -parent)
> +                seen.add(parent)

add a comment stating that pushing parent is mandatory if we want futur
call to containts to be correct.

> +                if parent == target:
> +                    targetseen = True
> +
> +        return targetseen

I would use a break instead of a boolean check

while visit and -visit[0] > target:
    for parent in parentrevs(-heappop(visit))
        if parent < stoprev or parent in seen:
            continue
        heappush(visit, -parent)
        seen.add(parent)
        if parent == target:
            break
else:
    return False
return True

(yop, python, while else?)
Siddharth Agarwal - Dec. 18, 2012, 5:25 a.m.
On 12/17/2012 10:31 AM, Pierre-Yves David wrote:
> Looks promising.
>
> But I'm not very happy to see that as "yes another different function".
>
> Isn't it possible to alter the original ancestors function for the same purpose?

I discussed this with Pierre and we came to a solution for this.

> Can't we apply this kind of of speedup to the revset version too an use
> it every where?

That will be much more work than this.

> You are starting from <revs> not tip isn't it ? 

Yes, fixed.

> increvs a bit misleading. I read it a "incremental revs". I would name 
> it "inclusive". I would also use a default to True to simplify signature. 

Fixed (with default False though because that's how revlog.ancestors was).

> <stoprev> is never used in the code actual user code. Is that not 
> possible.

This is now moot.

> A changelog argument would be better here in my opinion. You should 
> highlight why you are using a heapq instead of a deque here. 

Fixed and fixed.

> could be -visit[0] > target every part of visit already have been 
> checked agains target.

Fixed.

> (switching the two conditional sound like a possible micro 
> optimistation.) 

Fixed.

> add a comment stating that pushing parent is mandatory if we want 
> futur call to containts to be correct

Done.

> I would use a break instead of a boolean check
>
> while visit and -visit[0] > target:
>      for parent in parentrevs(-heappop(visit))
>          if parent < stoprev or parent in seen:
>              continue
>          heappush(visit, -parent)
>          seen.add(parent)
>          if parent == target:
>              break
> else:
>      return False
> return True
>
> (yop, python, while else?)
>

This code is wrong because it can drop parents on the floor while 
exiting. It is possible to fix this with two for loops over the parents 
instead of one, but at that point, why not just use a flag?
Pierre-Yves David - Dec. 18, 2012, 10:04 a.m.
On Mon, Dec 17, 2012 at 09:25:05PM -0800, Siddharth Agarwal wrote:
> On 12/17/2012 10:31 AM, Pierre-Yves David wrote:
> >Looks promising.
> >
> >But I'm not very happy to see that as "yes another different function".
> >
> >Isn't it possible to alter the original ancestors function for the same purpose?
> 
> I discussed this with Pierre and we came to a solution for this.

That is still Pierre-Yves ;-)

Patch

diff -r 35eb8d05863f -r a00028bfa9ee contrib/perf.py
--- a/contrib/perf.py	Fri Dec 14 10:23:18 2012 -0800
+++ b/contrib/perf.py	Fri Dec 14 15:48:39 2012 -0800
@@ -1,7 +1,7 @@ 
 # perf.py - performance test routines
 '''helper extension to measure performance'''
 
-from mercurial import cmdutil, scmutil, util, match, commands
+from mercurial import cmdutil, scmutil, util, match, commands, ancestor
 import time, os, sys
 
 def timer(func, title=None):
@@ -82,7 +82,7 @@ 
     revs = repo.revs(revset)
     heads = repo.changelog.headrevs()
     def d():
-        s = set(repo.changelog.ancestors(heads))
+        s = ancestor.lazyancestorset(repo, heads, False)
         for rev in revs:
             rev in s
     timer(d)
diff -r 35eb8d05863f -r a00028bfa9ee mercurial/ancestor.py
--- a/mercurial/ancestor.py	Fri Dec 14 10:23:18 2012 -0800
+++ b/mercurial/ancestor.py	Fri Dec 14 15:48:39 2012 -0800
@@ -316,3 +316,49 @@ 
 
     missing.reverse()
     return missing
+
+class lazyancestorset(object):
+    def __init__(self, repo, revs, increvs, stoprev=0):
+        """Create a new set-like object generating ancestors for the given
+        revs.
+
+        This is computed lazily starting from tip and working in descending rev
+        order. The set only supports the membership operation.
+
+        revs should be an iterable. pfunc must return a list of parent revs for
+        a given rev. increvs is a boolean that indicates whether revs should be
+        included in the set. Revs lower than stoprev will not be generated."""
+        # Python's heap is a min-heap. Multiply all values by -1 to convert it
+        # into a max-heap.
+        self._parentrevs = repo.changelog.parentrevs
+        self._visit = [-rev for rev in revs]
+        heapq.heapify(self._visit)
+        if increvs:
+            self._seen = set(revs)
+        else:
+            self._seen = set()
+        self._stoprev = stoprev
+
+    def __contains__(self, target):
+        seen = self._seen
+        if target in seen:
+            return True
+
+        parentrevs = self._parentrevs
+        visit = self._visit
+        stoprev = self._stoprev
+        heappop = heapq.heappop
+        heappush = heapq.heappush
+
+        targetseen = False
+
+        while visit and -visit[0] >= target and not targetseen:
+            for parent in parentrevs(-heappop(visit)):
+                if parent in seen or parent < stoprev:
+                    continue
+                heappush(visit, -parent)
+                seen.add(parent)
+                if parent == target:
+                    targetseen = True
+
+        return targetseen
diff -r 35eb8d05863f -r a00028bfa9ee tests/test-ancestor.py
--- a/tests/test-ancestor.py	Fri Dec 14 10:23:18 2012 -0800
+++ b/tests/test-ancestor.py	Fri Dec 14 15:48:39 2012 -0800
@@ -34,6 +34,10 @@ 
          13: [8]}
 pfunc = graph.get
 
+class mockrepo(object):
+    class changelog(object):
+        parentrevs = graph.get
+
 def runmissingancestors(revs, bases):
     print "%% ancestors of %s and not of %s" % (revs, bases)
     print ancestor.missingancestors(revs, bases, pfunc)
@@ -70,5 +74,32 @@ 
     runmissingancestors([10, 11, 12], [13])
     runmissingancestors([13], [10, 11, 12])
 
+def genlazyancestorset(revs, increvs, stoprev=0):
+    print "%% lazy ancestor set for %s, including revs = %s" % (revs, increvs)
+    return ancestor.lazyancestorset(mockrepo, revs, increvs, stoprev=stoprev)
+
+def isinlazyset(s, l):
+    print [n for n in l if n in s]
+
+def test_lazyancestorset():
+    # Empty revs
+    s = genlazyancestorset([], False)
+    isinlazyset(s, [3, 0, -1])
+
+    # Standard example
+    s = genlazyancestorset([11, 13], False)
+    isinlazyset(s, [11, 13, 7, 9, 8, 3, 6, 4, 1, -1, 0])
+
+    # Including revs
+    s = genlazyancestorset([11, 13], True)
+    isinlazyset(s, [11, 13, 7, 9, 8, 3, 6, 4, 1, -1, 0])
+
+    # Test with stoprev
+    s = genlazyancestorset([11, 13], False, stoprev=6)
+    isinlazyset(s, [11, 13, 7, 9, 8, 3, 6, 4, 1, -1, 0])
+    s = genlazyancestorset([11, 13], True, stoprev=6)
+    isinlazyset(s, [11, 13, 7, 9, 8, 3, 6, 4, 1, -1, 0])
+
 if __name__ == '__main__':
     test_missingancestors()
+    test_lazyancestorset()
diff -r 35eb8d05863f -r a00028bfa9ee tests/test-ancestor.py.out
--- a/tests/test-ancestor.py.out	Fri Dec 14 10:23:18 2012 -0800
+++ b/tests/test-ancestor.py.out	Fri Dec 14 15:48:39 2012 -0800
@@ -34,3 +34,13 @@ 
 [0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12]
 % ancestors of [13] and not of [10, 11, 12]
 [8, 13]
+% lazy ancestor set for [], including revs = False
+[]
+% lazy ancestor set for [11, 13], including revs = False
+[7, 8, 3, 4, 1, 0]
+% lazy ancestor set for [11, 13], including revs = True
+[11, 13, 7, 8, 3, 4, 1, 0]
+% lazy ancestor set for [11, 13], including revs = False
+[7, 8]
+% lazy ancestor set for [11, 13], including revs = True
+[11, 13, 7, 8]