Patchwork [rfc] convert: optimize convert of files that are unmodified from p2 in merges

login
register
mail settings
Submitter Mads Kiilerich
Date March 17, 2015, 6:53 p.m.
Message ID <3a28bb20a04f461abafa.1426618394@madski>
Download mbox | patch
Permalink /patch/8125/
State Superseded
Commit 216fa1ba999333e86200d2608e3b78670168d516
Headers show

Comments

Mads Kiilerich - March 17, 2015, 6:53 p.m.
# HG changeset patch
# User Mads Kiilerich <madski@unity3d.com>
# Date 1426618256 -3600
#      Tue Mar 17 19:50:56 2015 +0100
# Branch stable
# Node ID 3a28bb20a04f461abafa1843a8987f728de98f94
# Parent  b73a22d1d9bfe3a7f8633340ea75a0ab1526c21b
convert: optimize convert of files that are unmodified from p2 in merges

Conversion of a merge starts with p1 and re-adds the files that were changed in
the merge or came unmodified from p2. Files that are unmodified from p1 will
thus not be touched and take no time. Files that are unmodified from p2 would be
retrieved and rehashed. They would end up getting the same hash as in p2 and end
up reusing the filelog entry and look like the p1 case ... but it is slow.

Instead, make getchanges also return 'files that are unmodified from p2' so the
sink can reuse the existing p2 entry instead of calling getfile.

This do for unfortunate reasons make convert a bit slower for general use but
speeds things up a lot when you have an expensive custom getfile function
(think http://mercurial.selenic.com/wiki/ConvertExtension#Customization with a
code reformatter).

This is so far only implemented for the combination of hg source and hg sink.

This version is verbose when it reuses the p2 file content and thus shows the
test coverage. The message should probably just be a debug message.

TODO: Reusing the p2 filelog entry directly _should_ in principle be faster
than adding a new entry that ends up being a duplicate. Fixing that will
require changes to core Mercurial.
Matt Mackall - March 18, 2015, 8:54 p.m.
On Tue, 2015-03-17 at 19:53 +0100, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1426618256 -3600
> #      Tue Mar 17 19:50:56 2015 +0100
> # Branch stable
> # Node ID 3a28bb20a04f461abafa1843a8987f728de98f94
> # Parent  b73a22d1d9bfe3a7f8633340ea75a0ab1526c21b
> convert: optimize convert of files that are unmodified from p2 in merges

Seems reasonable.

> This version is verbose when it reuses the p2 file content and thus shows the
> test coverage. The message should probably just be a debug message.

Agreed.

> TODO: Reusing the p2 filelog entry directly _should_ in principle be faster
> than adding a new entry that ends up being a duplicate. Fixing that will
> require changes to core Mercurial.

Yeah.

Patch

diff --git a/hgext/convert/bzr.py b/hgext/convert/bzr.py
--- a/hgext/convert/bzr.py
+++ b/hgext/convert/bzr.py
@@ -143,7 +143,8 @@  class bzr_source(converter_source):
         parentids = self._parentids.pop(version)
         # only diff against first parent id
         prevtree = self.sourcerepo.revision_tree(parentids[0])
-        return self._gettreechanges(self._revtree, prevtree)
+        files, changes = self._gettreechanges(self._revtree, prevtree)
+        return files, changes, set()
 
     def getcommit(self, version):
         rev = self.sourcerepo.get_revision(version)
diff --git a/hgext/convert/common.py b/hgext/convert/common.py
--- a/hgext/convert/common.py
+++ b/hgext/convert/common.py
@@ -94,7 +94,7 @@  class converter_source(object):
         raise NotImplementedError
 
     def getchanges(self, version, full):
-        """Returns a tuple of (files, copies).
+        """Returns a tuple of (files, copies, cleanp2).
 
         files is a sorted list of (filename, id) tuples for all files
         changed between version and its first parent returned by
@@ -102,6 +102,9 @@  class converter_source(object):
         id is the source revision id of the file.
 
         copies is a dictionary of dest: source
+
+        cleanp2 is the set of files filenames that are clean against p2.
+        (files that are clean against p1 are not in files (unless full).)
         """
         raise NotImplementedError
 
@@ -212,7 +215,8 @@  class converter_sink(object):
         mapping equivalent authors identifiers for each system."""
         return None
 
-    def putcommit(self, files, copies, parents, commit, source, revmap, full):
+    def putcommit(self, files, copies, parents, commit, source, revmap, full,
+                  cleanp2):
         """Create a revision with all changed files listed in 'files'
         and having listed parents. 'commit' is a commit object
         containing at a minimum the author, date, and message for this
@@ -222,6 +226,8 @@  class converter_sink(object):
         of source revisions to converted revisions. Only getfile() and
         lookuprev() should be called on 'source'. 'full' means that 'files'
         is complete and all other files should be removed.
+        'cleanp2' is a set of the filenames that are unchanged from p2
+        (only in the common merge case where there two parents).
 
         Note that the sink repository is not told to update itself to
         a particular revision (or even what that revision would be)
diff --git a/hgext/convert/convcmd.py b/hgext/convert/convcmd.py
--- a/hgext/convert/convcmd.py
+++ b/hgext/convert/convcmd.py
@@ -397,7 +397,7 @@  class converter(object):
                 dest = self.map[changes]
             self.map[rev] = dest
             return
-        files, copies = changes
+        files, copies, cleanp2 = changes
         pbranches = []
         if commit.parents:
             for prev in commit.parents:
@@ -413,9 +413,11 @@  class converter(object):
             parents = [self.map.get(p, p) for p in parents]
         except KeyError:
             parents = [b[0] for b in pbranches]
+        if len(pbranches) != 2:
+            cleanp2 = set()
         source = progresssource(self.ui, self.source, len(files))
         newnode = self.dest.putcommit(files, copies, parents, commit,
-                                      source, self.map, full)
+                                      source, self.map, full, cleanp2)
         source.close()
         self.source.converted(rev, newnode)
         self.map[rev] = newnode
diff --git a/hgext/convert/cvs.py b/hgext/convert/cvs.py
--- a/hgext/convert/cvs.py
+++ b/hgext/convert/cvs.py
@@ -262,7 +262,7 @@  class convert_cvs(converter_source):
         if full:
             raise util.Abort(_("convert from cvs do not support --full"))
         self._parse()
-        return sorted(self.files[rev].iteritems()), {}
+        return sorted(self.files[rev].iteritems()), {}, set()
 
     def getcommit(self, rev):
         self._parse()
diff --git a/hgext/convert/darcs.py b/hgext/convert/darcs.py
--- a/hgext/convert/darcs.py
+++ b/hgext/convert/darcs.py
@@ -188,7 +188,7 @@  class darcs_source(converter_source, com
                 changes.append((elt.text.strip(), rev))
         self.pull(rev)
         self.lastrev = rev
-        return sorted(changes), copies
+        return sorted(changes), copies, set()
 
     def getfile(self, name, rev):
         if rev != self.lastrev:
diff --git a/hgext/convert/filemap.py b/hgext/convert/filemap.py
--- a/hgext/convert/filemap.py
+++ b/hgext/convert/filemap.py
@@ -384,12 +384,15 @@  class filemap_source(converter_source):
         # Get the real changes and do the filtering/mapping. To be
         # able to get the files later on in getfile, we hide the
         # original filename in the rev part of the return value.
-        changes, copies = self.base.getchanges(rev, full)
+        changes, copies, cleanp2 = self.base.getchanges(rev, full)
         files = {}
+        ncleanp2 = set(cleanp2)
         for f, r in changes:
             newf = self.filemapper(f)
             if newf and (newf != f or newf not in files):
                 files[newf] = (f, r)
+                if newf != f:
+                    ncleanp2.discard(f)
         files = sorted(files.items())
 
         ncopies = {}
@@ -400,7 +403,7 @@  class filemap_source(converter_source):
                 if newsource:
                     ncopies[newc] = newsource
 
-        return files, ncopies
+        return files, ncopies, ncleanp2
 
     def getfile(self, name, rev):
         realname, realrev = rev
diff --git a/hgext/convert/git.py b/hgext/convert/git.py
--- a/hgext/convert/git.py
+++ b/hgext/convert/git.py
@@ -264,7 +264,7 @@  class convert_git(converter_source):
             else:
                 self.retrievegitmodules(version)
                 changes.append(('.hgsubstate', ''))
-        return (changes, copies)
+        return (changes, copies, set())
 
     def getcommit(self, version):
         c = self.catfile(version, "commit") # read the commit hash
diff --git a/hgext/convert/gnuarch.py b/hgext/convert/gnuarch.py
--- a/hgext/convert/gnuarch.py
+++ b/hgext/convert/gnuarch.py
@@ -171,7 +171,7 @@  class gnuarch_source(converter_source, c
             copies.update(cps)
 
         self.lastrev = rev
-        return sorted(set(changes)), copies
+        return sorted(set(changes)), copies, set()
 
     def getcommit(self, rev):
         changes = self.changes[rev]
diff --git a/hgext/convert/hg.py b/hgext/convert/hg.py
--- a/hgext/convert/hg.py
+++ b/hgext/convert/hg.py
@@ -129,9 +129,14 @@  class mercurial_sink(converter_sink):
             fp.write('%s %s\n' % (revid, s[1]))
         return fp.getvalue()
 
-    def putcommit(self, files, copies, parents, commit, source, revmap, full):
+    def putcommit(self, files, copies, parents, commit, source, revmap, full, cleanp2):
         files = dict(files)
+
         def getfilectx(repo, memctx, f):
+            if p2ctx and f in cleanp2 and f not in copies:
+                self.ui.write('reusing %s from p2\n' % f)
+                return p2ctx[f]
+
             try:
                 v = files[f]
             except KeyError:
@@ -196,6 +201,10 @@  class mercurial_sink(converter_sink):
         while parents:
             p1 = p2
             p2 = parents.pop(0)
+            if p2 == nullid:
+                p2ctx = None
+            else:
+                p2ctx = self.repo[p2]
             fileset = set(files)
             if full:
                 fileset.update(self.repo[p1])
@@ -379,9 +388,12 @@  class mercurial_source(converter_source)
         # getcopies() is also run for roots and before filtering so missing
         # revlogs are detected early
         copies = self.getcopies(ctx, parents, copyfiles)
+        cleanp2 = set()
+        if len(parents) == 2:
+            cleanp2.update(self.repo.status(parents[1].node(), ctx.node(), clean=True).clean)
         changes = [(f, rev) for f in files if f not in self.ignored]
         changes.sort()
-        return changes, copies
+        return changes, copies, cleanp2
 
     def getcopies(self, ctx, parents, files):
         copies = {}
diff --git a/hgext/convert/monotone.py b/hgext/convert/monotone.py
--- a/hgext/convert/monotone.py
+++ b/hgext/convert/monotone.py
@@ -280,7 +280,7 @@  class monotone_source(converter_source, 
             for fromfile in renamed.values():
                 files[fromfile] = rev
 
-        return (files.items(), copies)
+        return (files.items(), copies, set())
 
     def getfile(self, name, rev):
         if not self.mtnisfile(name, rev):
diff --git a/hgext/convert/p4.py b/hgext/convert/p4.py
--- a/hgext/convert/p4.py
+++ b/hgext/convert/p4.py
@@ -195,7 +195,7 @@  class p4_source(converter_source):
     def getchanges(self, rev, full):
         if full:
             raise util.Abort(_("convert from p4 do not support --full"))
-        return self.files[rev], {}
+        return self.files[rev], {}, set()
 
     def getcommit(self, rev):
         return self.changeset[rev]
diff --git a/hgext/convert/subversion.py b/hgext/convert/subversion.py
--- a/hgext/convert/subversion.py
+++ b/hgext/convert/subversion.py
@@ -474,7 +474,7 @@  class svn_source(converter_source):
             (files, copies) = self._getchanges(rev, full)
             # caller caches the result, so free it here to release memory
             del self.paths[rev]
-        return (files, copies)
+        return (files, copies, set())
 
     def getchangedfiles(self, rev, i):
         # called from filemap - cache computed values for reuse in getchanges
@@ -1229,7 +1229,8 @@  class svn_sink(converter_sink, commandli
     def revid(self, rev):
         return u"svn:%s@%s" % (self.uuid, rev)
 
-    def putcommit(self, files, copies, parents, commit, source, revmap, full):
+    def putcommit(self, files, copies, parents, commit, source, revmap, full,
+                  cleanp2):
         for parent in parents:
             try:
                 return self.revid(self.childmap[parent])
diff --git a/tests/test-convert-datesort.t b/tests/test-convert-datesort.t
--- a/tests/test-convert-datesort.t
+++ b/tests/test-convert-datesort.t
@@ -79,6 +79,7 @@  convert with datesort
   2 a7x
   1 b2x
   0 c1
+  reusing b from p2
 
 graph converted repo
 
@@ -130,6 +131,7 @@  convert with datesort (default mode)
   2 a7x
   1 b2x
   0 c1
+  reusing b from p2
 
 graph converted repo
 
@@ -181,6 +183,7 @@  convert with closesort
   2 b2x
   1 c0
   0 c1
+  reusing b from p2
 
 graph converted repo
 
diff --git a/tests/test-convert-filemap.t b/tests/test-convert-filemap.t
--- a/tests/test-convert-filemap.t
+++ b/tests/test-convert-filemap.t
@@ -108,6 +108,7 @@  Test interaction with startrev and verif
   1 7: second merge; change bar
   warning: af455ce4166b3c9c88e6309c2b9332171dcea595 parent 61e22ca76c3b3e93df20338c4e02ce286898e825 is missing
   warning: cf908b3eeedc301c9272ebae931da966d5b326c7 parent 59e1ab45c888289513b7354484dac8a88217beab is missing
+  reusing foo from p2
   0 8: change foo
 
 
@@ -169,6 +170,8 @@  splitrepo tests
   94c1be4dfde2ee8d78db8bbfcf81210813307c3d 644   baz
   $ splitrepo 'we add additional merges when they are interesting' 'foo quux'
   % foo quux: we add additional merges when they are interesting
+  reusing quux from p2
+  reusing foo from p2
   @  8 "8: change foo" files: foo
   |
   o    7 "7: second merge; change bar" files:
@@ -655,6 +658,7 @@  test named branch pruning
   2 changea
   1 changeb
   0 merge
+  reusing a from p2
   $ glog -R namedbranch
   @    3:e1959de76e1b@foo "merge" files:
   |\
diff --git a/tests/test-convert-hg-sink.t b/tests/test-convert-hg-sink.t
--- a/tests/test-convert-hg-sink.t
+++ b/tests/test-convert-hg-sink.t
@@ -204,6 +204,7 @@  Convert excluding rev 0 and dir/ (and th
   3 3: copy a to e, change b
   2 4: change a
   1 5: merge 2 and 3, copy b to dir/d
+  reusing b from p2
   0 6: change a
 
 Verify that conversion skipped rev 2:
diff --git a/tests/test-convert-hg-source.t b/tests/test-convert-hg-source.t
--- a/tests/test-convert-hg-source.t
+++ b/tests/test-convert-hg-source.t
@@ -54,6 +54,7 @@  Test that template can print all file co
   4 change foo
   3 make bar and baz copies of foo
   2 merge local copy
+  reusing foo from p2
   1 merge remote copy
   0 mark baz executable
   updating bookmarks
@@ -135,6 +136,7 @@  break it
   3 changeall
   2 changebagain
   1 merge
+  reusing a from p2
   0 moveb
   $ hg -R fixed verify
   checking changesets
diff --git a/tests/test-convert-hg-startrev.t b/tests/test-convert-hg-startrev.t
--- a/tests/test-convert-hg-startrev.t
+++ b/tests/test-convert-hg-startrev.t
@@ -49,6 +49,9 @@  Convert from null revision
   3 2: copy e from a, change b
   2 3: change a
   1 4: merge 2 and 3
+  reusing b from p2
+  reusing c from p2
+  reusing d from p2
   0 5: change a
 
   $ glog full
@@ -78,6 +81,9 @@  Convert from zero revision
   3 2: copy e from a, change b
   2 3: change a
   1 4: merge 2 and 3
+  reusing b from p2
+  reusing c from p2
+  reusing d from p2
   0 5: change a
 
   $ glog full