Patchwork [4,of,4] changegroup: compute seen files as changesets are added (issue4750)

login
register
mail settings
Submitter Gregory Szorc
Date July 10, 2015, 12:09 a.m.
Message ID <e86c10381256d2996583.1436486941@gps-mbp.local>
Download mbox | patch
Permalink /patch/9948/
State Superseded
Commit 2406e2baa93787ec2dfa12b0f820214a41b559a0
Headers show

Comments

Gregory Szorc - July 10, 2015, 12:09 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1436485900 25200
#      Thu Jul 09 16:51:40 2015 -0700
# Node ID e86c10381256d2996583f7978c9b1b7f636ec1d8
# Parent  c9abd93973708d02d7a945cbddaba9b420a18cfa
changegroup: compute seen files as changesets are added (issue4750)

Before this patch, addchangegroup() would walk the changelog and compute
the set of seen files between applying changesets and applying
manifests. When cloning large repositories such as mozilla-central,
this consumed a non-trivial amount of time. On my MBP, this walk takes
~10s. On a dainty EC2 instance, this was measured to take ~125s! On the
latter machine, this delay was enough for the Mercurial server to
disconnect the client, thinking it had timed out, thus causing a clone
to abort.

This patch enables the changelog to compute the set of changed files as
new revisions are added. By doing so, we:

* avoid a potentially heavy computation between changelog and manifest
  processing by spreading the computation across all changelog additions
* avoid extra reads from the changelog by operating on the data as it is
  added

On my MBP, the total CPU times for an `hg unbundle` with a local
mozilla-central gzip bundle containing 251,934 changesets and 211,065
files are as follows:

before: 360.1s
after:  359.0s

While the new time does appear to be a bit faster, I think this is
within the margin of error.

In addition, there is no longer a visible pause between applying
changeset and manifest data. Before, it sure felt like Mercurial was
lethargic making this transition. Now, the transition is nearly
instantaneous, giving the impression that Mercurial is faster. Of course,
eliminating this pause means that the potential for network disconnect due
to channel inactivity during the changelog walk is eliminated as well.
Matt Mackall - July 10, 2015, 10:17 p.m.
On Thu, 2015-07-09 at 17:09 -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1436485900 25200
> #      Thu Jul 09 16:51:40 2015 -0700
> # Node ID e86c10381256d2996583f7978c9b1b7f636ec1d8
> # Parent  c9abd93973708d02d7a945cbddaba9b420a18cfa
> changegroup: compute seen files as changesets are added (issue4750)

This all looks pretty good, but it seems like it'd be much tidier as a
callback on addgroup:

seen = set()
def mycallback(revlog, rev, text):
   seen.update(changelogmod.parse(text)[3]) # instead of _newchangelog

srccontent = cl.addgroup(source, csmap, trp, mycallback)

Also, I'm not sure it's not just easier to fetch a revision right after
addrevision rather than have addrevision have a weird calling pattern.
After all, we cache the latest text when it's available, right?

So the above becomes:

def mycallback(cl, rev):
   seen.update(cl.read(rev)[3])

..and a lot of extra mechanics goes away?
Gregory Szorc - July 10, 2015, 11:33 p.m.
On Fri, Jul 10, 2015 at 3:17 PM, Matt Mackall <mpm@selenic.com> wrote:

> On Thu, 2015-07-09 at 17:09 -0700, Gregory Szorc wrote:
> > # HG changeset patch
> > # User Gregory Szorc <gregory.szorc@gmail.com>
> > # Date 1436485900 25200
> > #      Thu Jul 09 16:51:40 2015 -0700
> > # Node ID e86c10381256d2996583f7978c9b1b7f636ec1d8
> > # Parent  c9abd93973708d02d7a945cbddaba9b420a18cfa
> > changegroup: compute seen files as changesets are added (issue4750)
>
> This all looks pretty good, but it seems like it'd be much tidier as a
> callback on addgroup:
>
> seen = set()
> def mycallback(revlog, rev, text):
>    seen.update(changelogmod.parse(text)[3]) # instead of _newchangelog
>
> srccontent = cl.addgroup(source, csmap, trp, mycallback)
>

I like this idea and will incorporate it.


>
> Also, I'm not sure it's not just easier to fetch a revision right after
> addrevision rather than have addrevision have a weird calling pattern.
> After all, we cache the latest text when it's available, right?
>
> So the above becomes:
>
> def mycallback(cl, rev):
>    seen.update(cl.read(rev)[3])
>
> ..and a lot of extra mechanics goes away?
>
>
I initially tried this or something like this. However, I was getting
errors due to some weird I/O issue that kinda seemed like something was
trying to read from the revlog or index before data had been written or
flushed or something. But I can't reproduce it right now. So maybe I was
just doing things wrong. It took a few iterations of this to make it happy.
I'll try your approach and see what happens.

Patch

diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
--- a/mercurial/changegroup.py
+++ b/mercurial/changegroup.py
@@ -719,9 +719,8 @@  def addchangegroup(repo, source, srctype
     if not source:
         return 0
 
     changesets = files = revisions = 0
-    efiles = set()
 
     tr = repo.transaction("\n".join([srctype, util.hidepassword(url)]))
     # The transaction could have been created before and already carries source
     # information. In this case we use the top level data. We overwrite the
@@ -753,16 +752,19 @@  def addchangegroup(repo, source, srctype
                 self._count += 1
         source.callback = prog(_('changesets'), expectedtotal)
 
         source.changelogheader()
-        srccontent = cl.addgroup(source, csmap, trp)
+        try:
+            cl.seenfiles = set()
+            srccontent = cl.addgroup(source, csmap, trp)
+            efiles = len(cl.seenfiles)
+        finally:
+            cl.seenfiles = None
+
         if not (srccontent or emptyok):
             raise util.Abort(_("received changelog group is empty"))
         clend = len(cl)
         changesets = clend - clstart
-        for c in xrange(clstart, clend):
-            efiles.update(repo[c].files())
-        efiles = len(efiles)
         repo.ui.progress(_('changesets'), None)
 
         # pull off the manifest group
         repo.ui.status(_("adding manifests\n"))