Patchwork [4,of,4] revlog: add an aggressivemergedelta option

login
register
mail settings
Submitter Durham Goode
Date Aug. 30, 2015, 10:09 p.m.
Message ID <ffaccbf9565ec5f73ded.1440972595@dev2000.prn2.facebook.com>
Download mbox | patch
Permalink /patch/10327/
State Accepted
Headers show

Comments

Durham Goode - Aug. 30, 2015, 10:09 p.m.
# HG changeset patch
# User Durham Goode <durham@fb.com>
# Date 1440968612 25200
#      Sun Aug 30 14:03:32 2015 -0700
# Node ID ffaccbf9565ec5f73ded2ae3f9f611c943baad17
# Parent  03338124de4c33848974b5a512d4e6ef71b4424b
revlog: add an aggressivemergedelta option

This adds an option for delta'ing against both p1 and p2 when applying merge
revisions and picking whichever is smallest.

Some before and after stats on manifest.d size:

internal large repo:
before: 1.2 GB
after: 930 MB

mozilla-central:
before: 261 MB
after: 92 MB
Matt Mackall - Aug. 31, 2015, 10:48 p.m.
On Sun, 2015-08-30 at 15:09 -0700, Durham Goode wrote:
> # HG changeset patch
> # User Durham Goode <durham@fb.com>
> # Date 1440968612 25200
> #      Sun Aug 30 14:03:32 2015 -0700
> # Node ID ffaccbf9565ec5f73ded2ae3f9f611c943baad17
> # Parent  03338124de4c33848974b5a512d4e6ef71b4424b
> revlog: add an aggressivemergedelta option
> 
> This adds an option for delta'ing against both p1 and p2 when applying merge
> revisions and picking whichever is smallest.

I've queued all four of these, thanks.

> Some before and after stats on manifest.d size:
> 
> internal large repo:
> before: 1.2 GB
> after: 930 MB
> 
> mozilla-central:
> before: 261 MB
> after: 92 MB

This is big enough that I'd probably consider just enabling it all the
time. Two considerations here:

- bandwidth is typically scarcer than CPU
- it's a one-time cost at commit, rather than an on-going cost
Durham Goode - Aug. 31, 2015, 10:53 p.m.
On 8/31/15 3:48 PM, Matt Mackall wrote:
> On Sun, 2015-08-30 at 15:09 -0700, Durham Goode wrote:
>> # HG changeset patch
>> # User Durham Goode <durham@fb.com>
>> # Date 1440968612 25200
>> #      Sun Aug 30 14:03:32 2015 -0700
>> # Node ID ffaccbf9565ec5f73ded2ae3f9f611c943baad17
>> # Parent  03338124de4c33848974b5a512d4e6ef71b4424b
>> revlog: add an aggressivemergedelta option
>>
>> This adds an option for delta'ing against both p1 and p2 when applying merge
>> revisions and picking whichever is smallest.
> I've queued all four of these, thanks.
>
>> Some before and after stats on manifest.d size:
>>
>> internal large repo:
>> before: 1.2 GB
>> after: 930 MB
>>
>> mozilla-central:
>> before: 261 MB
>> after: 92 MB
> This is big enough that I'd probably consider just enabling it all the
> time. Two considerations here:
>
> - bandwidth is typically scarcer than CPU
> - it's a one-time cost at commit, rather than an on-going cost
>
It's not actually a one-time cost though.  A user pulling the merge 
commit will build a delta against p1 (hopefully the downloaded delta 
provides this, so it's free), but then he will also compute the delta 
for p2, which involves decompressing the manifest, etc.  So pulling N 
merge commits will incur at least N extra manifest decompressions.

We could probably make it smarter by saying 'if the server gave me a p1 
delta, don't even bother with p2 unless we're going to result in a 
fulltext'.
Gregory Szorc - Aug. 31, 2015, 10:56 p.m.
On Mon, Aug 31, 2015 at 6:53 PM, Durham Goode <durham@fb.com> wrote:

>
>
> On 8/31/15 3:48 PM, Matt Mackall wrote:
>
>> On Sun, 2015-08-30 at 15:09 -0700, Durham Goode wrote:
>>
>>> # HG changeset patch
>>> # User Durham Goode <durham@fb.com>
>>> # Date 1440968612 25200
>>> #      Sun Aug 30 14:03:32 2015 -0700
>>> # Node ID ffaccbf9565ec5f73ded2ae3f9f611c943baad17
>>> # Parent  03338124de4c33848974b5a512d4e6ef71b4424b
>>> revlog: add an aggressivemergedelta option
>>>
>>> This adds an option for delta'ing against both p1 and p2 when applying
>>> merge
>>> revisions and picking whichever is smallest.
>>>
>> I've queued all four of these, thanks.
>>
>> Some before and after stats on manifest.d size:
>>>
>>> internal large repo:
>>> before: 1.2 GB
>>> after: 930 MB
>>>
>>> mozilla-central:
>>> before: 261 MB
>>> after: 92 MB
>>>
>> This is big enough that I'd probably consider just enabling it all the
>> time. Two considerations here:
>>
>> - bandwidth is typically scarcer than CPU
>> - it's a one-time cost at commit, rather than an on-going cost
>>
>> It's not actually a one-time cost though.  A user pulling the merge
> commit will build a delta against p1 (hopefully the downloaded delta
> provides this, so it's free), but then he will also compute the delta for
> p2, which involves decompressing the manifest, etc.  So pulling N merge
> commits will incur at least N extra manifest decompressions.
>
> We could probably make it smarter by saying 'if the server gave me a p1
> delta, don't even bother with p2 unless we're going to result in a
> fulltext'.
>

Could bundle2 advertise which delta algorithm is in play so clients can
fast path if the local store agrees with the remote stream?
Matt Mackall - Aug. 31, 2015, 11:11 p.m.
On Mon, 2015-08-31 at 15:53 -0700, Durham Goode wrote:
> 
> On 8/31/15 3:48 PM, Matt Mackall wrote:
> > On Sun, 2015-08-30 at 15:09 -0700, Durham Goode wrote:
> >> # HG changeset patch
> >> # User Durham Goode <durham@fb.com>
> >> # Date 1440968612 25200
> >> #      Sun Aug 30 14:03:32 2015 -0700
> >> # Node ID ffaccbf9565ec5f73ded2ae3f9f611c943baad17
> >> # Parent  03338124de4c33848974b5a512d4e6ef71b4424b
> >> revlog: add an aggressivemergedelta option
> >>
> >> This adds an option for delta'ing against both p1 and p2 when applying merge
> >> revisions and picking whichever is smallest.
> > I've queued all four of these, thanks.
> >
> >> Some before and after stats on manifest.d size:
> >>
> >> internal large repo:
> >> before: 1.2 GB
> >> after: 930 MB
> >>
> >> mozilla-central:
> >> before: 261 MB
> >> after: 92 MB
> > This is big enough that I'd probably consider just enabling it all the
> > time. Two considerations here:
> >
> > - bandwidth is typically scarcer than CPU
> > - it's a one-time cost at commit, rather than an on-going cost
> >
> It's not actually a one-time cost though.  A user pulling the merge 
> commit will build a delta against p1 (hopefully the downloaded delta 
> provides this, so it's free), but then he will also compute the delta 
> for p2, which involves decompressing the manifest, etc.  So pulling N 
> merge commits will incur at least N extra manifest decompressions.

Ahh, right.

> We could probably make it smarter by saying 'if the server gave me a p1 
> delta, don't even bother with p2 unless we're going to result in a 
> fulltext'.

Or else we could say "I've observed the server pick the modern choice
instead of the old choice correctly, going to trust it for the rest of
the pull". Might be too complex.
Gregory Szorc - Aug. 31, 2015, 11:12 p.m.
On Mon, Aug 31, 2015 at 6:48 PM, Matt Mackall <mpm@selenic.com> wrote:

> On Sun, 2015-08-30 at 15:09 -0700, Durham Goode wrote:
> > # HG changeset patch
> > # User Durham Goode <durham@fb.com>
> > # Date 1440968612 25200
> > #      Sun Aug 30 14:03:32 2015 -0700
> > # Node ID ffaccbf9565ec5f73ded2ae3f9f611c943baad17
> > # Parent  03338124de4c33848974b5a512d4e6ef71b4424b
> > revlog: add an aggressivemergedelta option
> >
> > This adds an option for delta'ing against both p1 and p2 when applying
> merge
> > revisions and picking whichever is smallest.
>
> I've queued all four of these, thanks.
>
> > Some before and after stats on manifest.d size:
> >
> > internal large repo:
> > before: 1.2 GB
> > after: 930 MB
> >
> > mozilla-central:
> > before: 261 MB
> > after: 92 MB
>
> This is big enough that I'd probably consider just enabling it all the
> time. Two considerations here:
>
> - bandwidth is typically scarcer than CPU
> - it's a one-time cost at commit, rather than an on-going cost
>

Great series!

I'm a bit surprised by the mozilla-central manifest size regression in part
3. But I understand why. It is mostly a side-effect of Mozilla's rather
arcane use of multiple heads to simultaneously accept incoming changesets
because having just one head results in too much instability (we make heads
read-only until bustage-causing changesets are backed out and N>1 heads
usually means at least 1 is accepting changesets). We end up doing merges
across ~4 heads to keep all the heads relatively in sync. I wouldn't sweat
too much over the size regression, especially since generaldelta still
comes out significantly ahead of non-gd.

I agree with mpm that making the aggressive merge the default would be
nice. It won't impact 95+% of repositories since manifests won't be large
enough for the CPU hit to be realized. That being said, a lot of people do
make aggressive use of merges these days. So I don't know.

It would be interesting to see if we can come up with a bundle2 extension
so clients can fast path apply deltas without having to compute fulltexts.
But since this option is applied at run time and not at revlog creation
time (I think), I think you would have to blindly trust that the server is
using the optimal parent. I'm sure this will be discussed as part of the
making GD default effort.

Again, great series!

Patch

diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -354,6 +354,10 @@  class localrepository(object):
         manifestcachesize = self.ui.configint('format', 'manifestcachesize')
         if manifestcachesize is not None:
             self.svfs.options['manifestcachesize'] = manifestcachesize
+        # experimental config: format.aggressivemergedeltas
+        aggressivemergedeltas = self.ui.configbool('format',
+            'aggressivemergedeltas', False)
+        self.svfs.options['aggressivemergedeltas'] = aggressivemergedeltas
 
     def _writerequirements(self):
         scmutil.writerequires(self.vfs, self.requirements)
diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -210,6 +210,7 @@  class revlog(object):
         self._chunkcache = (0, '')
         self._chunkcachesize = 65536
         self._maxchainlen = None
+        self._aggressivemergedeltas = False
         self.index = []
         self._pcache = {}
         self._nodecache = {nullid: nullrev}
@@ -227,6 +228,8 @@  class revlog(object):
                 self._chunkcachesize = opts['chunkcachesize']
             if 'maxchainlen' in opts:
                 self._maxchainlen = opts['maxchainlen']
+            if 'aggressivemergedeltas' in opts:
+                self._aggressivemergedeltas = opts['aggressivemergedeltas']
 
         if self._chunkcachesize <= 0:
             raise RevlogError(_('revlog chunk cache size %r is not greater '
@@ -1343,15 +1346,34 @@  class revlog(object):
         # should we try to build a delta?
         if prev != nullrev:
             if self._generaldelta:
-                # Pick whichever parent is closer to us (to minimize the
-                # chance of having to build a fulltext). Since
-                # nullrev == -1, any non-merge commit will always pick p1r.
-                drev = p2r if p2r > p1r else p1r
-                d = builddelta(drev)
-                # If the chosen delta will result in us making a full text,
-                # give it one last try against prev.
-                if drev != prev and not self._isgooddelta(d, textlen):
-                    d = builddelta(prev)
+                if p2r != nullrev and self._aggressivemergedeltas:
+                    d = builddelta(p1r)
+                    d2 = builddelta(p2r)
+                    p1good = self._isgooddelta(d, textlen)
+                    p2good = self._isgooddelta(d2, textlen)
+                    if p1good and p2good:
+                        # If both are good deltas, choose the smallest
+                        if d2[1] < d[1]:
+                            d = d2
+                    elif p2good:
+                        # If only p2 is good, use it
+                        d = d2
+                    elif p1good:
+                        pass
+                    else:
+                        # Neither is good, try against prev to hopefully save us
+                        # a fulltext.
+                        d = builddelta(prev)
+                else:
+                    # Pick whichever parent is closer to us (to minimize the
+                    # chance of having to build a fulltext). Since
+                    # nullrev == -1, any non-merge commit will always pick p1r.
+                    drev = p2r if p2r > p1r else p1r
+                    d = builddelta(drev)
+                    # If the chosen delta will result in us making a full text,
+                    # give it one last try against prev.
+                    if drev != prev and not self._isgooddelta(d, textlen):
+                        d = builddelta(prev)
             else:
                 d = builddelta(prev)
             dist, l, data, base, chainbase, chainlen, compresseddeltalen = d
diff --git a/tests/test-generaldelta.t b/tests/test-generaldelta.t
--- a/tests/test-generaldelta.t
+++ b/tests/test-generaldelta.t
@@ -69,3 +69,37 @@  commit.
      rev    offset  length   base linkrev nodeid       p1           p2
        0         0       3      0       1 1406e7411862 000000000000 000000000000
 
+  $ cd ..
+
+Test format.aggressivemergedeltas
+
+  $ hg init --config format.generaldelta=1 aggressive
+  $ cd aggressive
+  $ touch a b c d e
+  $ hg commit -Aqm side1
+  $ hg up -q null
+  $ touch x y
+  $ hg commit -Aqm side2
+
+- Verify non-aggressive merge uses p1 (commit 1) as delta parent
+  $ hg merge -q 0
+  $ hg commit -q -m merge
+  $ hg debugindex -m
+     rev    offset  length  delta linkrev nodeid       p1           p2
+       0         0      59     -1       0 8dde941edb6e 000000000000 000000000000
+       1        59      59     -1       1 315c023f341d 000000000000 000000000000
+       2       118      65      1       2 2ab389a983eb 315c023f341d 8dde941edb6e
+
+  $ hg strip -q -r . --config extensions.strip=
+
+- Verify aggressive merge uses p2 (commit 0) as delta parent
+  $ hg up -q -C 1
+  $ hg merge -q 0
+  $ hg commit -q -m merge --config format.aggressivemergedeltas=True
+  $ hg debugindex -m
+     rev    offset  length  delta linkrev nodeid       p1           p2
+       0         0      59     -1       0 8dde941edb6e 000000000000 000000000000
+       1        59      59     -1       1 315c023f341d 000000000000 000000000000
+       2       118      62      0       2 2ab389a983eb 315c023f341d 8dde941edb6e
+
+  $ cd ..