Patchwork manifest: improve filesnotin performance by using lazymanifest diff

login
register
mail settings
Submitter Tony Tung
Date April 27, 2016, 7:23 a.m.
Message ID <7f80dce78781f5fe691a.1461741827@andromeda.local>
Download mbox | patch
Permalink /patch/14799/
State Superseded
Commit e2178f7d17c01fcd5a60b7c2a3f016f1d464c8cf
Headers show

Comments

Tony Tung - April 27, 2016, 7:23 a.m.
# HG changeset patch
# User Tony Tung <tonytung@merly.org>
# Date 1461740718 25200
#      Wed Apr 27 00:05:18 2016 -0700
# Branch stable
# Node ID 7f80dce78781f5fe691a23f1b7f5a110ed170f32
# Parent  97811ff7964710d32cae951df1da8019b46151a2
manifest: improve filesnotin performance by using lazymanifest diff

lazymanifests can compute diffs significantly faster than taking the set
of two manifests and calculating the delta.
Sean Farley - April 27, 2016, 8:07 a.m.
Tony Tung <ttung@fb.com> writes:

> # HG changeset patch
> # User Tony Tung <tonytung@merly.org>
> # Date 1461740718 25200
> #      Wed Apr 27 00:05:18 2016 -0700
> # Branch stable
> # Node ID 7f80dce78781f5fe691a23f1b7f5a110ed170f32
> # Parent  97811ff7964710d32cae951df1da8019b46151a2
> manifest: improve filesnotin performance by using lazymanifest diff
>
> lazymanifests can compute diffs significantly faster than taking the set
> of two manifests and calculating the delta.

FYI, we're currently in a feature freeze:

https://www.mercurial-scm.org/wiki/TimeBasedReleasePlan

> diff --git a/mercurial/manifest.py b/mercurial/manifest.py
> --- a/mercurial/manifest.py
> +++ b/mercurial/manifest.py
> @@ -211,8 +211,10 @@
>  
>      def filesnotin(self, m2):
>          '''Set of files in this manifest that are not in the other'''
> -        files = set(self)
> -        files.difference_update(m2)
> +        diff = self.diff(m2)
> +        files = set(filepath
> +                    for filepath, hashflags in diff.items()
> +                    if hashflags[1][0] is None)

(for after May 1st) Would it be feasible to have a perf test for this
(and some sweet, sweet performance numbers in the commit message)?
via Mercurial-devel - April 27, 2016, 1:20 p.m.
On Wed, Apr 27, 2016, 01:07 Sean Farley <sean@farley.io> wrote:

>
> Tony Tung <ttung@fb.com> writes:
>
> > # HG changeset patch
> > # User Tony Tung <tonytung@merly.org>
> > # Date 1461740718 25200
> > #      Wed Apr 27 00:05:18 2016 -0700
> > # Branch stable
> > # Node ID 7f80dce78781f5fe691a23f1b7f5a110ed170f32
> > # Parent  97811ff7964710d32cae951df1da8019b46151a2
> > manifest: improve filesnotin performance by using lazymanifest diff
> >
> > lazymanifests can compute diffs significantly faster than taking the set
> > of two manifests and calculating the delta.
>
> FYI, we're currently in a feature freeze:
>
> https://www.mercurial-scm.org/wiki/TimeBasedReleasePlan
>
> > diff --git a/mercurial/manifest.py b/mercurial/manifest.py
> > --- a/mercurial/manifest.py
> > +++ b/mercurial/manifest.py
> > @@ -211,8 +211,10 @@
> >
> >      def filesnotin(self, m2):
> >          '''Set of files in this manifest that are not in the other'''
> > -        files = set(self)
> > -        files.difference_update(m2)
> > +        diff = self.diff(m2)
> > +        files = set(filepath
> > +                    for filepath, hashflags in diff.items()
>

iteritems() may be noticeably faster on large diffs

> +                    if hashflags[1][0] is None)
>
> (for after May 1st) Would it be feasible to have a perf test for this
> (and some sweet, sweet performance numbers in the commit message)?
>

As usual, I will request real-world perf numbers (instead or in addition).
It would be nice to have them for both a good (small diff) and a bad case
(large diff). Thanks.
via Mercurial-devel - April 27, 2016, 3:32 p.m.
Comments inline.

On Apr 27, 2016, at 6:20 AM, Martin von Zweigbergk <martinvonz@google.com<mailto:martinvonz@google.com>> wrote:



On Wed, Apr 27, 2016, 01:07 Sean Farley <sean@farley.io<mailto:sean@farley.io>> wrote:

Tony Tung <ttung@fb.com<mailto:ttung@fb.com>> writes:

> # HG changeset patch

> # User Tony Tung <tonytung@merly.org<mailto:tonytung@merly.org>>

> # Date 1461740718 25200

> #      Wed Apr 27 00:05:18 2016 -0700

> # Branch stable

> # Node ID 7f80dce78781f5fe691a23f1b7f5a110ed170f32

> # Parent  97811ff7964710d32cae951df1da8019b46151a2

> manifest: improve filesnotin performance by using lazymanifest diff

>

> lazymanifests can compute diffs significantly faster than taking the set

> of two manifests and calculating the delta.


FYI, we're currently in a feature freeze:

https://www.mercurial-scm.org/wiki/TimeBasedReleasePlan

Will resubmit.


> diff --git a/mercurial/manifest.py b/mercurial/manifest.py

> --- a/mercurial/manifest.py

> +++ b/mercurial/manifest.py

> @@ -211,8 +211,10 @@

>

>      def filesnotin(self, m2):

>          '''Set of files in this manifest that are not in the other'''

> -        files = set(self)

> -        files.difference_update(m2)

> +        diff = self.diff(m2)

> +        files = set(filepath

> +                    for filepath, hashflags in diff.items()


iteritems() may be noticeably faster on large diffs

I was under the impression that items() had the same performance characteristics as iteritems(), but apparently, that’s only for python 3.  Will fix.


> +                    if hashflags[1][0] is None)


(for after May 1st) Would it be feasible to have a perf test for this
(and some sweet, sweet performance numbers in the commit message)?

As usual, I will request real-world perf numbers (instead or in addition). It would be nice to have them for both a good (small diff) and a bad case (large diff). Thanks.

hg diff -c . on Facebook’s large repo takes 1.5s instead of 2.1s with this change.  In the case of large diffs, I suspect the performance regression would be drowned out by the file system operations.
via Mercurial-devel - April 27, 2016, 3:57 p.m.
On Wed, Apr 27, 2016 at 8:32 AM, Tony Tung <tonytung@instagram.com> wrote:
> Comments inline.
>
> On Apr 27, 2016, at 6:20 AM, Martin von Zweigbergk <martinvonz@google.com>
> wrote:
>
>
>
> On Wed, Apr 27, 2016, 01:07 Sean Farley <sean@farley.io> wrote:
>>
>>
>> Tony Tung <ttung@fb.com> writes:
>>
>> > # HG changeset patch
>> > # User Tony Tung <tonytung@merly.org>
>> > # Date 1461740718 25200
>> > #      Wed Apr 27 00:05:18 2016 -0700
>> > # Branch stable
>> > # Node ID 7f80dce78781f5fe691a23f1b7f5a110ed170f32
>> > # Parent  97811ff7964710d32cae951df1da8019b46151a2
>> > manifest: improve filesnotin performance by using lazymanifest diff
>> >
>> > lazymanifests can compute diffs significantly faster than taking the set
>> > of two manifests and calculating the delta.
>>
>> FYI, we're currently in a feature freeze:
>>
>> https://www.mercurial-scm.org/wiki/TimeBasedReleasePlan
>
>
> Will resubmit.
>
>>
>> > diff --git a/mercurial/manifest.py b/mercurial/manifest.py
>> > --- a/mercurial/manifest.py
>> > +++ b/mercurial/manifest.py
>> > @@ -211,8 +211,10 @@
>> >
>> >      def filesnotin(self, m2):
>> >          '''Set of files in this manifest that are not in the other'''
>> > -        files = set(self)
>> > -        files.difference_update(m2)
>> > +        diff = self.diff(m2)
>> > +        files = set(filepath
>> > +                    for filepath, hashflags in diff.items()
>
>
> iteritems() may be noticeably faster on large diffs
>
>
> I was under the impression that items() had the same performance
> characteristics as iteritems(), but apparently, that’s only for python 3.
> Will fix.
>
>
>> > +                    if hashflags[1][0] is None)
>>
>> (for after May 1st) Would it be feasible to have a perf test for this
>> (and some sweet, sweet performance numbers in the commit message)?
>
>
> As usual, I will request real-world perf numbers (instead or in addition).
> It would be nice to have them for both a good (small diff) and a bad case
> (large diff). Thanks.
>
>
> hg diff -c . on Facebook’s large repo takes 1.5s instead of 2.1s with this
> change.

Wow, that's slow :-( Great improvement, though. I'm guessing you have
--git on by default, because otherwise I don't think this method will
be called (please include the --git in the updated commit message). I
didn't see any significant speedup on the Mozilla repo (118k files at
my revision).

>  In the case of large diffs, I suspect the performance regression
> would be drowned out by the file system operations.

For diff and merge, yes, but not for "status -C" (those are the cases
I can think of where this method is called).

Patch

diff --git a/mercurial/manifest.py b/mercurial/manifest.py
--- a/mercurial/manifest.py
+++ b/mercurial/manifest.py
@@ -211,8 +211,10 @@ 
 
     def filesnotin(self, m2):
         '''Set of files in this manifest that are not in the other'''
-        files = set(self)
-        files.difference_update(m2)
+        diff = self.diff(m2)
+        files = set(filepath
+                    for filepath, hashflags in diff.items()
+                    if hashflags[1][0] is None)
         return files
 
     @propertycache