Patchwork [RFC] changegroup: leave out all parent file revisions when creating bundles

login
register
mail settings
Submitter Mads Kiilerich
Date Oct. 11, 2019, 5:57 p.m.
Message ID <72d12ee773795edc163f.1570816661@madski>
Download mbox | patch
Permalink /patch/42239/
State New
Headers show

Comments

Mads Kiilerich - Oct. 11, 2019, 5:57 p.m.
# HG changeset patch
# User Mads Kiilerich <madski@unity3d.com>
# Date 1570804632 -7200
#      Fri Oct 11 16:37:12 2019 +0200
# Node ID 72d12ee773795edc163f73b9160e5d29022878dd
# Parent  52781d57313d512efb7150603104bea3ca11d0eb
changegroup: leave out all parent file revisions when creating bundles

When creating a bundle, we get the revisions to bundle, and also the set of
revisions that have been discovered to be common between sender and receiver of
the bundle - which by definition will include all the ancestor revisions. File
revisions where linkrev points at common revisions will be pruned from the
bundle. The common revisions can be represented lazy and efficiently.

That is making the optimistic assumption that the linkrev revision is the
common one. Sometimes it isn't. In that case, the bundle will contain file
revisions that already are common. The increased size of the bundle might be a
problem, but it also makes the bundle size a useless approximation of
information content, and of how much it actually will increase repository size.

The linkrev based pruning does, however, have the advantage that it can leave
out common file revisions, even without being ancestors. Such aggressive
pruning will create bundles that only apply in a specific context, and
effectively make them depend on additional parents than the actual bundle
parents. It is debatable if that is a good feature and how much it actually
gains, but let's leave that out of scope and keep it for now.

We want to avoid that bundles contain filelog entries that are present in
parent revisions of the bundle content. We will do that by enumerating the
files that seem to be changed anywhere in the bundled set, find all parent
revisions of the bundled changesets, for the changed files find the set of file
revisions in the parent changesets, and leave these out when bundling.

To avoid visiting all base revisions for each file, we start by collecting
revisions of all touched files in all parents and keep that in memory. That
will be O(files_changed_in_set * revisions) ... but while bundled changesets
might be complex internally and all could have two unique parents outside the
bundle, they rarely have a lot of parents outside the set. It is also rare to
change a lot of files ... but even then it seems reasonable to be able to hold
the names of all files and some hashes in memory.) (Alternatively, we could
keep all manifests around and look files up on demand ... but manifests might
be big and changes are usually sparse, so it seems more efficient to compute
the sets upfront.)

Testing with the current verbose logging shows where this change makes a
difference ... including a few "because parents, not linkrev" places where this
change will have a positive impact. The changes generally seem reasonable to me
...

TODO:
gather feedback
more testing
investigate the
test-remotefilelog-bgprefetch.t instability
reduce the current logging
Mads Kiilerich - Oct. 14, 2019, 4:24 p.m.
On 10/11/19 7:57 PM, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1570804632 -7200
> #      Fri Oct 11 16:37:12 2019 +0200
> # Node ID 72d12ee773795edc163f73b9160e5d29022878dd
> # Parent  52781d57313d512efb7150603104bea3ca11d0eb
> changegroup: leave out all parent file revisions when creating bundles

I'm not sure the analysis nailed the root cause. It is tricky to 
reproduce good test cases.

But one problem we could fix by pruning ancestors instead of using linkrevs:

Create a repo with aliasing:

   $ hg init repo1
   $ cd repo1
   $ touch f
   $ hg ci -Aqm 0
   $ echo 1 > f
   $ hg ci -m 1f1
   $ hg up -cqr 0
   $ hg branch -q b
   $ echo 1 > f  # linkrev aliasing to rev 1
   $ hg ci -m 2f1

When bundling rev 2 for a repo that has rev 1, f will be skipped even 
though it
isn't an ancestor:

   $ hg bundle -v -r 2 --base 1 bundle.hg
   1 changesets found
   uncompressed size of bundle content:
        185 (changelog)
        163 (manifests)

A bundle with missing ancestor revisions would fail unbundling with "abort:
00changelog.i@d681519c3ea7: unknown parent!". But if using 
fastpathlinkrev and
the ancestors are present but the file revs are not, the bundle can be 
applied
but will break on use:

   $ hg clone -qU . -r 0 repo2
   $ hg -R repo2 pull bundle.hg
   pulling from bundle.hg
   searching for changes
   adding changesets
   adding manifests
   adding file changes
   added 1 changesets with 0 changes to 0 files
   new changesets 5e690c649d09 (1 drafts)
   (run 'hg update' to get a working copy)
   $ hg -R repo2 up -r tip
   abort: data/f.i@d0c79e1d3309: no match found!
   [255]


/Mads

Patch

diff --git a/hgext/remotefilelog/shallowbundle.py b/hgext/remotefilelog/shallowbundle.py
--- a/hgext/remotefilelog/shallowbundle.py
+++ b/hgext/remotefilelog/shallowbundle.py
@@ -71,7 +71,8 @@  class shallowcg1packer(changegroup.cgpac
         try:
             linknodes, commonrevs, source = args
         except ValueError:
-            commonrevs, source, mfdicts, fastpathlinkrev, fnodes, clrevs = args
+            (commonrevs, source, mfdicts, fastpathlinkrev, fnodes, clrevs,
+             setparents) = args
         if shallowutil.isenabled(self._repo):
             repo = self._repo
             if isinstance(repo, bundlerepo.bundlerepository):
diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
--- a/mercurial/changegroup.py
+++ b/mercurial/changegroup.py
@@ -931,6 +931,9 @@  class cgpacker(object):
         clrevorder = clstate[b'clrevorder']
         manifests = clstate[b'manifests']
         changedfiles = clstate[b'changedfiles']
+        setparents = sorted(set(clstate[b'parentrevs'])
+                            .difference(clstate[b'changerevs'])
+                            .difference([-1]))
 
         # We need to make sure that the linkrev in the changegroup refers to
         # the first changeset that introduced the manifest or file revision.
@@ -1005,6 +1008,7 @@  class cgpacker(object):
             fastpathlinkrev,
             fnodes,
             clrevs,
+            setparents,
         )
 
         for path, deltas in it:
@@ -1044,12 +1048,16 @@  class cgpacker(object):
         mfl = self._repo.manifestlog
         changedfiles = set()
         clrevtomanifestrev = {}
+        changerevs = set()
+        parentrevs = set()
 
         state = {
             b'clrevorder': clrevorder,
             b'manifests': manifests,
             b'changedfiles': changedfiles,
             b'clrevtomanifestrev': clrevtomanifestrev,
+            b'changerevs': changerevs,
+            b'parentrevs': parentrevs,
         }
 
         if not (generate or self._ellipses):
@@ -1106,6 +1114,9 @@  class cgpacker(object):
                 # Record a complete list of potentially-changed files in
                 # this manifest.
                 changedfiles.update(c.files)
+                rev = cl.rev(x)
+                changerevs.add(rev)
+                parentrevs.update(cl.parentrevs(rev))
 
             return x
 
@@ -1263,12 +1274,18 @@  class cgpacker(object):
         fastpathlinkrev,
         fnodes,
         clrevs,
+        setparents,
     ):
         changedfiles = [
             f
             for f in changedfiles
             if self._matcher(f) and not self._oldmatcher(f)
         ]
+        skipfilenodes = {f: set() for f in changedfiles}
+        for node in setparents:
+            mf = self._repo[node].manifest()
+            for f in changedfiles:
+                skipfilenodes[f].add(mf.get(f))
 
         if not fastpathlinkrev:
 
@@ -1339,8 +1356,19 @@  class cgpacker(object):
             # has. This avoids over-sending files relatively
             # inexpensively, so it's not a problem if we under-filter
             # here.
+            for n in linkrevnodes:
+                if (n in skipfilenodes[fname] and
+                    flr(frev(n)) not in commonrevs
+                ):
+                    repo.ui.error(_(b"skipping %s rev %s because parents, not linkrev\n") % (fname, short(n)))
+                if (n not in skipfilenodes[fname] and
+                    flr(frev(n)) in commonrevs
+                ):
+                    repo.ui.error(_(b"skipping %s rev %s because linkrev, not parents\n") % (fname, short(n)))
             filenodes = [
-                n for n in linkrevnodes if flr(frev(n)) not in commonrevs
+                n for n in linkrevnodes
+                if n not in skipfilenodes[fname]
+                and flr(frev(n)) not in commonrevs
             ]
 
             if not filenodes:
diff --git a/tests/test-bundle-r.t b/tests/test-bundle-r.t
--- a/tests/test-bundle-r.t
+++ b/tests/test-bundle-r.t
@@ -184,6 +184,8 @@  should fail
   2 changesets found
   $ hg -R test bundle --base 2 -r 7 test-bundle-branch2.hg
   4 changesets found
+  skipping afile rev 4c982badb186 because linkrev, not parents
+  skipping afile rev 125144f7e028 because linkrev, not parents
   $ hg -R test bundle --base 2 test-bundle-all.hg
   6 changesets found
   $ hg -R test bundle --base 2 --all test-bundle-all-2.hg
diff --git a/tests/test-bundle.t b/tests/test-bundle.t
--- a/tests/test-bundle.t
+++ b/tests/test-bundle.t
@@ -803,6 +803,7 @@  bundle single branch
   $ hg in -R part --bundle incoming.hg --template "{node}\n" .
   comparing with .
   searching for changes
+  skipping x rev d98b3f566194 because linkrev, not parents
   1a38c1b849e8b70c756d2d80b0b9a3ac0b7ea11a
   057f4db07f61970e1c11e83be79e9d08adc4dc31
 
diff --git a/tests/test-histedit-bookmark-motion.t b/tests/test-histedit-bookmark-motion.t
--- a/tests/test-histedit-bookmark-motion.t
+++ b/tests/test-histedit-bookmark-motion.t
@@ -88,6 +88,7 @@ 
   > fold e860deea161a 4 e
   > pick 652413bf663e 5 f
   > EOF
+  skipping c rev 149da44f2a4e because parents, not linkrev
   saved backup bundle to $TESTTMP/r/.hg/strip-backup/96e494a2d553-45c027ab-histedit.hg
   $ hg log --graph
   @  changeset:   3:cacdfd884a93
diff --git a/tests/test-histedit-fold.t b/tests/test-histedit-fold.t
--- a/tests/test-histedit-fold.t
+++ b/tests/test-histedit-fold.t
@@ -63,6 +63,7 @@  log before edit
   > fold ff2c9fa2018b c
   > pick 532247a8969b d
   > EOF
+  skipping e rev 6b67ccefd5ce because parents, not linkrev
 
 log after edit
   $ hg logt --graph
diff --git a/tests/test-push-race.t b/tests/test-push-race.t
--- a/tests/test-push-race.t
+++ b/tests/test-push-race.t
@@ -1270,6 +1270,7 @@  Pushing
   $ hg -R client-other push -fr 'tip' --new-branch
   pushing to ssh://user@dummy/server
   searching for changes
+  skipping a rev 26657918ef2d because linkrev, not parents
   remote: adding changesets
   remote: adding manifests
   remote: adding file changes
@@ -1620,6 +1621,7 @@  Pushing
   $ hg -R client-other push -fr 'tip' --new-branch
   pushing to ssh://user@dummy/server
   searching for changes
+  skipping a rev 26657918ef2d because linkrev, not parents
   remote: adding changesets
   remote: adding manifests
   remote: adding file changes
diff --git a/tests/test-removeemptydirs.t b/tests/test-removeemptydirs.t
--- a/tests/test-removeemptydirs.t
+++ b/tests/test-removeemptydirs.t
@@ -253,6 +253,7 @@  Histedit doing 'pick, pick, fold':
   > pick ff70a87b588f 0 add foo
   > fold 9992bb0ac0db 2 add baz
   > EOF
+  skipping bar/bar rev b80de5d13875 because parents, not linkrev
   saved backup bundle to $TESTTMP/issue5826_withrm/issue5826_norm/.hg/strip-backup/5c806432464a-cd4c8d86-histedit.hg
 
 Note the lack of a 'cd' being necessary here, and we don't need to 'histedit
diff --git a/tests/test-ssh-clone-r.t b/tests/test-ssh-clone-r.t
--- a/tests/test-ssh-clone-r.t
+++ b/tests/test-ssh-clone-r.t
@@ -163,6 +163,7 @@  clone remote via stream
   $ hg pull -e "\"$PYTHON\" \"$TESTDIR/dummyssh\"" -r 4 ssh://user@dummy/remote
   pulling from ssh://user@dummy/remote
   searching for changes
+  remote: skipping afile rev 125144f7e028 because linkrev, not parents
   adding changesets
   adding manifests
   adding file changes
@@ -189,6 +190,8 @@  clone remote via stream
   $ hg pull -e "\"$PYTHON\" \"$TESTDIR/dummyssh\"" -r 5 ssh://user@dummy/remote
   pulling from ssh://user@dummy/remote
   searching for changes
+  remote: skipping afile rev 4c982badb186 because linkrev, not parents
+  remote: skipping afile rev 125144f7e028 because linkrev, not parents
   adding changesets
   adding manifests
   adding file changes
diff --git a/tests/test-strip.t b/tests/test-strip.t
--- a/tests/test-strip.t
+++ b/tests/test-strip.t
@@ -461,6 +461,7 @@  Failed hook while applying "saveheads" b
 2 different branches: 2 strips
 
   $ hg strip 2 4
+  skipping bar rev 2ee5932da385 because parents, not linkrev
   saved backup bundle to $TESTTMP/test/.hg/strip-backup/*-backup.hg (glob)
   $ hg log -G
   o  changeset:   2:65bd5f99a4a3
diff --git a/tests/test-treemanifest.t b/tests/test-treemanifest.t
--- a/tests/test-treemanifest.t
+++ b/tests/test-treemanifest.t
@@ -881,6 +881,7 @@  other branch
   $ hg pull -r 2
   pulling from $TESTTMP/grafted-dir-repo
   searching for changes
+  skipping dir/file rev bc7ebe2d260c because linkrev, not parents
   adding changesets
   adding manifests
   adding file changes