Patchwork [V2] copies: optimize forward copy detection logic for rebases

login
register
mail settings
Submitter Durham Goode
Date Feb. 5, 2016, 9:23 p.m.
Message ID <64726d7baee7e6a32780.1454707436@dev8486.prn1.facebook.com>
Download mbox | patch
Permalink /patch/13021/
State Accepted
Headers show

Comments

Durham Goode - Feb. 5, 2016, 9:23 p.m.
# HG changeset patch
# User Durham Goode <durham@fb.com>
# Date 1454707404 28800
#      Fri Feb 05 13:23:24 2016 -0800
# Node ID 64726d7baee7e6a32780a31e861cb5e9af8aefc6
# Parent  01a5143cd25f285f8c745a92986cd7186bb32c90
copies: optimize forward copy detection logic for rebases

Forward copy detection (i.e. detecting what files have been moved/copied in
commit X since ancestor Y) previously required diff'ing the manifests of both X
and Y. This was expensive since it required reading both entire manifests and
doing a set difference (they weren't already in a set because of the
lazymanifest work). This cost almost 1 second on very large repositories, and
happens N times for a rebase of N commits.

This patch optimizes it for the case of rebase. In a rebase, we are comparing a
commit against it's immediate parent, and therefore we can know what files
changed by looking at ctx.files().  This let's us drastically decrease the size
of the set comparison, and makes it O(# of changes) instead of O(size of
manifest). This makes it take 1ms instead of 1000ms.
Augie Fackler - Feb. 5, 2016, 9:43 p.m.
On Fri, Feb 05, 2016 at 01:23:56PM -0800, Durham Goode wrote:
> # HG changeset patch
> # User Durham Goode <durham@fb.com>
> # Date 1454707404 28800
> #      Fri Feb 05 13:23:24 2016 -0800
> # Node ID 64726d7baee7e6a32780a31e861cb5e9af8aefc6
> # Parent  01a5143cd25f285f8c745a92986cd7186bb32c90
> copies: optimize forward copy detection logic for rebases

Dropped v1 and queued this one, many thanks!

>
> Forward copy detection (i.e. detecting what files have been moved/copied in
> commit X since ancestor Y) previously required diff'ing the manifests of both X
> and Y. This was expensive since it required reading both entire manifests and
> doing a set difference (they weren't already in a set because of the
> lazymanifest work). This cost almost 1 second on very large repositories, and
> happens N times for a rebase of N commits.
>
> This patch optimizes it for the case of rebase. In a rebase, we are comparing a
> commit against it's immediate parent, and therefore we can know what files
> changed by looking at ctx.files().  This let's us drastically decrease the size
> of the set comparison, and makes it O(# of changes) instead of O(size of
> manifest). This makes it take 1ms instead of 1000ms.
>
> diff --git a/mercurial/copies.py b/mercurial/copies.py
> --- a/mercurial/copies.py
> +++ b/mercurial/copies.py
> @@ -10,7 +10,9 @@ from __future__ import absolute_import
>  import heapq
>
>  from . import (
> +    node,
>      pathutil,
> +    scmutil,
>      util,
>  )
>
> @@ -175,7 +177,18 @@ def _forwardcopies(a, b, match=None):
>      # we currently don't try to find where old files went, too expensive
>      # this means we can miss a case like 'hg rm b; hg cp a b'
>      cm = {}
> -    missing = _computeforwardmissing(a, b, match=match)
> +
> +    # Computing the forward missing is quite expensive on large manifests, since
> +    # it compares the entire manifests. We can optimize it in the common use
> +    # case of computing what copies are in a commit versus its parent (like
> +    # during a rebase or histedit). Note, we exclude merge commits from this
> +    # optimization, since the ctx.files() for a merge commit is not correct for
> +    # this comparison.
> +    forwardmissingmatch = match
> +    if not match and b.p1() == a and b.p2().node() == node.nullid:
> +        forwardmissingmatch = scmutil.matchfiles(a._repo, b.files())
> +    missing = _computeforwardmissing(a, b, match=forwardmissingmatch)
> +
>      ancestrycontext = a._repo.changelog.ancestors([b.rev()], inclusive=True)
>      for f in missing:
>          fctx = b[f]
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Patch

diff --git a/mercurial/copies.py b/mercurial/copies.py
--- a/mercurial/copies.py
+++ b/mercurial/copies.py
@@ -10,7 +10,9 @@  from __future__ import absolute_import
 import heapq
 
 from . import (
+    node,
     pathutil,
+    scmutil,
     util,
 )
 
@@ -175,7 +177,18 @@  def _forwardcopies(a, b, match=None):
     # we currently don't try to find where old files went, too expensive
     # this means we can miss a case like 'hg rm b; hg cp a b'
     cm = {}
-    missing = _computeforwardmissing(a, b, match=match)
+
+    # Computing the forward missing is quite expensive on large manifests, since
+    # it compares the entire manifests. We can optimize it in the common use
+    # case of computing what copies are in a commit versus its parent (like
+    # during a rebase or histedit). Note, we exclude merge commits from this
+    # optimization, since the ctx.files() for a merge commit is not correct for
+    # this comparison.
+    forwardmissingmatch = match
+    if not match and b.p1() == a and b.p2().node() == node.nullid:
+        forwardmissingmatch = scmutil.matchfiles(a._repo, b.files())
+    missing = _computeforwardmissing(a, b, match=forwardmissingmatch)
+
     ancestrycontext = a._repo.changelog.ancestors([b.rev()], inclusive=True)
     for f in missing:
         fctx = b[f]