Patchwork [3,of,4] copies: calculate 'bothnew' from manifestdict.filesnotin()

login
register
mail settings
Submitter Martin von Zweigbergk
Date Feb. 27, 2015, 10:46 p.m.
Message ID <b3bcf58446fdebf3672e.1425077211@martinvonz.mtv.corp.google.com>
Download mbox | patch
Permalink /patch/7855/
State Accepted
Commit 61aadba2396e07b3a6f2b19a65d4d4f0b8a0024f
Headers show

Comments

Martin von Zweigbergk - Feb. 27, 2015, 10:46 p.m.
# HG changeset patch
# User Martin von Zweigbergk <martinvonz@google.com>
# Date 1425074581 28800
#      Fri Feb 27 14:03:01 2015 -0800
# Node ID b3bcf58446fdebf3672edbbc55c24509e549eb22
# Parent  89f810fb00184d3a1dd49412d0c3256a596ddca8
copies: calculate 'bothnew' from manifestdict.filesnotin()

In the same spirit as the previous change, let's now calculate the
'bothnew' variable using manifestdict.filesnotin().5D
Durham Goode - Feb. 28, 2015, 1:08 a.m.
On 2/27/15, 2:46 PM, "Martin von Zweigbergk" <martinvonz@google.com> wrote:

># HG changeset patch
># User Martin von Zweigbergk <martinvonz@google.com>
># Date 1425074581 28800
>#      Fri Feb 27 14:03:01 2015 -0800
># Node ID b3bcf58446fdebf3672edbbc55c24509e549eb22
># Parent  89f810fb00184d3a1dd49412d0c3256a596ddca8
>copies: calculate 'bothnew' from manifestdict.filesnotin()
>
>In the same spirit as the previous change, let's now calculate the
>'bothnew' variable using manifestdict.filesnotin().5D
>
>diff -r 89f810fb0018 -r b3bcf58446fd mercurial/copies.py
>--- a/mercurial/copies.py	Fri Feb 27 14:02:30 2015 -0800
>+++ b/mercurial/copies.py	Fri Feb 27 14:03:01 2015 -0800
>@@ -302,7 +302,9 @@
>         else:
>             diverge2.update(fl) # reverse map for below
> 
>-    bothnew = sorted([d for d in m1 if d in m2 and d not in ma])
>+    addedinm1 = m1.filesnotin(ma)
>+    addedinm2 = m2.filesnotin(ma)
>+    bothnew = sorted(addedinm1 & addedinm2)

I was concerned about perf here (since we¹re constructing sets when we
used to not, and we¹re iterating over both m1 and m2 when we used to not),
but I ran the numbers and the new stuff is actually faster. To diff
million file manifests with 5k files different each, previously it was
0.36s and now it¹s 0.23.


The script for posterity:

#!/bin/env python
import time

fm1 = list(xrange(0, 990000))
fm2 = list(xrange(0, 990000))
fma = list(xrange(0, 990000))
fm1.extend(range(1000000, 1005000))
fm2.extend(range(1005000, 1010000))
fma.extend(range(1010000, 1015000))

m1 = set(fm1)
m2 = set(fm2)
ma = set(fma)

start = time.time()
old = sorted([d for d in fm1 if d in m2 and d not in ma])
print ("Old %s" % str(time.time() - start))

def doit(d1, d2):
    missing = set(d1)
    missing.difference_update(d2)
    return missing

start = time.time()
addedinm1 = doit(fm1, fma)
addedinm2 = doit(fm2, fma)
bothnew = sorted(addedinm1 & addedinm2)
print ("New %s" % str(time.time() - start))

~/local> ./testset.py
Old 0.343271970749
New 0.231050014496
Martin von Zweigbergk - Feb. 28, 2015, 1:15 a.m.
On Fri, Feb 27, 2015 at 5:08 PM Durham Goode <durham@fb.com> wrote:

>
>
> On 2/27/15, 2:46 PM, "Martin von Zweigbergk" <martinvonz@google.com>
> wrote:
>
> ># HG changeset patch
> ># User Martin von Zweigbergk <martinvonz@google.com>
> ># Date 1425074581 28800
> >#      Fri Feb 27 14:03:01 2015 -0800
> ># Node ID b3bcf58446fdebf3672edbbc55c24509e549eb22
> ># Parent  89f810fb00184d3a1dd49412d0c3256a596ddca8
> >copies: calculate 'bothnew' from manifestdict.filesnotin()
> >
> >In the same spirit as the previous change, let's now calculate the
> >'bothnew' variable using manifestdict.filesnotin().5D
> >
> >diff -r 89f810fb0018 -r b3bcf58446fd mercurial/copies.py
> >--- a/mercurial/copies.py      Fri Feb 27 14:02:30 2015 -0800
> >+++ b/mercurial/copies.py      Fri Feb 27 14:03:01 2015 -0800
> >@@ -302,7 +302,9 @@
> >         else:
> >             diverge2.update(fl) # reverse map for below
> >
> >-    bothnew = sorted([d for d in m1 if d in m2 and d not in ma])
> >+    addedinm1 = m1.filesnotin(ma)
> >+    addedinm2 = m2.filesnotin(ma)
> >+    bothnew = sorted(addedinm1 & addedinm2)
>
> I was concerned about perf here (since we¹re constructing sets when we
> used to not, and we¹re iterating over both m1 and m2 when we used to not),
> but I ran the numbers and the new stuff is actually faster. To diff
> million file manifests with 5k files different each, previously it was
> 0.36s and now it¹s 0.23.
>

Great! And thanks for doing my job for me! :-)

Even if it were a little slower, it should hopefully have been outweighed
after 4/4. But even better that it's already after 3/4.

Also, to whomever applies this, feel free to drop (or not) the "5D" at the
end of the message.

Patch

diff -r 89f810fb0018 -r b3bcf58446fd mercurial/copies.py
--- a/mercurial/copies.py	Fri Feb 27 14:02:30 2015 -0800
+++ b/mercurial/copies.py	Fri Feb 27 14:03:01 2015 -0800
@@ -302,7 +302,9 @@ 
         else:
             diverge2.update(fl) # reverse map for below
 
-    bothnew = sorted([d for d in m1 if d in m2 and d not in ma])
+    addedinm1 = m1.filesnotin(ma)
+    addedinm2 = m2.filesnotin(ma)
+    bothnew = sorted(addedinm1 & addedinm2)
     if bothnew:
         repo.ui.debug("  unmatched files new in both:\n   %s\n"
                       % "\n   ".join(bothnew))