Patchwork D1224: merge: cache unknown dir checks

login
register
mail settings
Submitter phabricator
Date Oct. 24, 2017, 3:02 p.m.
Message ID <differential-rev-PHID-DREV-fkrzpd4h3tghjiwqeojx-req@phab.mercurial-scm.org>
Download mbox | patch
Permalink /patch/25276/
State Superseded
Headers show

Comments

phabricator - Oct. 24, 2017, 3:02 p.m.
mbthomas created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  As mentioned in https://phab.mercurial-scm.org/D1222, the recent pathconflicts change regresses update
  performance in large repositories when many files are being updated.
  
  To mitigate this, we introduce two caches of directories that have
  already found to be either:
  
  - unknown directories, but which are not aliased by files and so don't need to be checked if they are files again; and
  - missing directores, which cannot cause path conflicts, and cannot contain a file that causes a path conflict.
  
  When checking the paths of a file, testing against this caches means we can
  skip tests that involve touching the filesystem.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1224

AFFECTED FILES
  mercurial/merge.py

CHANGE DETAILS




To: mbthomas, #hg-reviewers
Cc: mercurial-devel
phabricator - Oct. 24, 2017, 4:08 p.m.
krbullock added a comment.


  If this mitigates or fixes issue5716, please mention it in the summary line.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1224

To: mbthomas, #hg-reviewers
Cc: krbullock, mercurial-devel
phabricator - Oct. 24, 2017, 4:10 p.m.
krbullock added a comment.


  Code LGTM, but I haven't done any performance testing on this.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1224

To: mbthomas, #hg-reviewers
Cc: krbullock, mercurial-devel
phabricator - Oct. 24, 2017, 6 p.m.
mbthomas added a comment.


  There is still some cost to doing the check, so it's not a complete reversal of issue5716, but it helps a lot.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1224

To: mbthomas, #hg-reviewers
Cc: krbullock, mercurial-devel
phabricator - Oct. 24, 2017, 6:04 p.m.
sid0 added a comment.


  Could you make this a class with unknowndircache and missingdircache being private members?
  
  How much do you think writing some of this code in native code (e.g. Rust) would help?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1224

To: mbthomas, #hg-reviewers
Cc: sid0, krbullock, mercurial-devel
phabricator - Oct. 25, 2017, 9:35 a.m.
mbthomas added a comment.


  Updated as requested.
  
  The whole of `_checkunknownfiles` is probably a reasonable candidate for native code if we want to speed up `update`.  It's the second most expensive part, after actually updating the files.  The function is also pretty isolated - it accesses the dirstate (which may also be native code), and in some cases it checks the file contents on disk against the destination revision, but that's pretty rare.  It might even be possible to do some of it in parallel.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1224

To: mbthomas, #hg-reviewers
Cc: sid0, krbullock, mercurial-devel
phabricator - Oct. 30, 2017, 9:09 p.m.
durin42 added a comment.


  Is it okay to punt on this until after 4.4?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1224

To: mbthomas, #hg-reviewers
Cc: durin42, sid0, krbullock, mercurial-devel
phabricator - Nov. 2, 2017, 3:35 p.m.
mbthomas added a comment.


  Yes, fine to punt to after 4.4.  With https://phab.mercurial-scm.org/D1223 the checks are disabled, so we're back to the old behaviour for path conflicts, but the perf is ok.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1224

To: mbthomas, #hg-reviewers
Cc: durin42, sid0, krbullock, mercurial-devel
phabricator - Nov. 13, 2017, 10:53 p.m.
durin42 requested changes to this revision.
durin42 added a comment.
This revision now requires changes to proceed.


  needs rebased, otherwise looks fine

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1224

To: mbthomas, #hg-reviewers, durin42
Cc: durin42, sid0, krbullock, mercurial-devel

Patch

diff --git a/mercurial/merge.py b/mercurial/merge.py
--- a/mercurial/merge.py
+++ b/mercurial/merge.py
@@ -653,23 +653,40 @@ 
         and repo.dirstate.normalize(f) not in repo.dirstate
         and mctx[f2].cmp(wctx[f]))
 
-def _checkunknowndirs(repo, f):
+def _checkunknowndirs(repo, f, unknowndircache, missingdircache):
     """
     Look for any unknown files or directories that may have a path conflict
     with a file.  If any path prefix of the file exists as a file or link,
     then it conflicts.  If the file itself is a directory that contains any
     file that is not tracked, then it conflicts.
 
+    `unknowndircache` is a set of paths known to be good.  This prevents
+    repeated checking of dirs.  It will be updated with any new dirs that
+    are checked and found to be safe.
+
+    `missingdircache` is a set of paths that are known to be absent.  This
+    prevents repeated checking of subdirectories that are known not to exist.
+    It will be updated with any new dirs that are checked and found to be
+    absent.
+
     Returns the shortest path at which a conflict occurs, or None if there is
     no conflict.
     """
 
     # Check for path prefixes that exist as unknown files.
     for p in reversed(list(util.finddirs(f))):
-        if (repo.wvfs.audit.check(p)
-                and repo.wvfs.isfileorlink(p)
-                and repo.dirstate.normalize(p) not in repo.dirstate):
-            return p
+        if p in missingdircache:
+            return
+        if p in unknowndircache:
+            continue
+        if repo.wvfs.audit.check(p):
+            if (repo.wvfs.isfileorlink(p)
+                    and repo.dirstate.normalize(p) not in repo.dirstate):
+                return p
+            if not repo.wvfs.lexists(p):
+                missingdircache.add(p)
+                return
+            unknowndircache.add(p)
 
     # Check if the file conflicts with a directory containing unknown files.
     if repo.wvfs.audit.check(f) and repo.wvfs.isdir(f):
@@ -700,12 +717,15 @@ 
             elif config == 'warn':
                 warnconflicts.update(conflicts)
 
+        unknowndircache = set()
+        missingdircache = set()
         for f, (m, args, msg) in actions.iteritems():
             if m in ('c', 'dc'):
                 if _checkunknownfile(repo, wctx, mctx, f):
                     fileconflicts.add(f)
                 elif f not in wctx:
-                    path = _checkunknowndirs(repo, f)
+                    path = _checkunknowndirs(repo, f,
+                                             unknowndircache, missingdircache)
                     if path is not None:
                         pathconflicts.add(path)
             elif m == 'dg':