Patchwork D7665: dirstate: when calling rebuild(), avoid some N^2 codepaths

login
register
mail settings
Submitter phabricator
Date Dec. 13, 2019, 11:58 p.m.
Message ID <differential-rev-PHID-DREV-etc5myaxrltruk5lsb67-req@mercurial-scm.org>
Download mbox | patch
Permalink /patch/43830/
State Superseded
Headers show

Comments

phabricator - Dec. 13, 2019, 11:58 p.m.
spectral created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  I had a user repo with 200k files in it. Calling `hg debugrebuilddirstate` took
  tens of minutes (I didn't wait for it). In that situation,
  changedfiles==allfiles, and both are lists. This meant that we had to run an
  average of 100k comparisons, for each of 200k files, just to check whether a
  file needed to have normallookup called (it always did), or drop.
  
  While it's probably not a huge issue, in my very awkward synthetic benchmark I
  wrote (not using a benchmark library or anything), I was seeing some slowdowns
  for small-changedfiles and very-large-allfiles invocations, with an inflection
  somewhere around 10 items in changedfiles (regardless of the size of allfiles);
  above 10 items in changedfiles, the new code appears to always be faster. For
  the case of 50k files in changedfiles and the same items in allfiles, I'm seeing
  differences of 15s of just running comparisons vs. 0.003793s. I haven't bothered
  to run a comparison of 200k items in changedfiles and allfiles. :)

REPOSITORY
  rHG Mercurial

BRANCH
  default

REVISION DETAIL
  https://phab.mercurial-scm.org/D7665

AFFECTED FILES
  mercurial/dirstate.py

CHANGE DETAILS




To: spectral, #hg-reviewers
Cc: mercurial-devel
phabricator - Dec. 14, 2019, 12:09 a.m.
martinvonz added inline comments.

INLINE COMMENTS

> dirstate.py:611-620
> +        elif len(changedfiles) < 10:
> +            # Avoid turning allfiles into a set, which can be expensive if it's
> +            # large.
> +            to_lookup = []
> +            to_drop = []
> +            for f in changedfiles:
> +                if f in allfiles:

How slow? Specifically, how much slower (in percent, or dB, or whatever) is it compared to not converting to a set? I wonder if it's worth the code. `hg debugrebuilddirstate` should be a very rare operation.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7665/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7665

To: spectral, #hg-reviewers
Cc: martinvonz, mercurial-devel
phabricator - Dec. 14, 2019, 1:27 a.m.
spectral added inline comments.

INLINE COMMENTS

> martinvonz wrote in dirstate.py:611-620
> How slow? Specifically, how much slower (in percent, or dB, or whatever) is it compared to not converting to a set? I wonder if it's worth the code. `hg debugrebuilddirstate` should be a very rare operation.

With 1 file in changedfiles and 200k files in allfiles, approximately 15x slower. That said, it's a little over 1.5ms in the old version, so in absolute terms this isn't a big difference until you get really huge repos.  I haven't really investigated when changedfiles is not None, if it's just for `hg debugrebuilddirstate --minimal`, I agree this isn't worth the complexity, but it looks like mq, strip, and absorb all call rebuild in some cases, and strip and absorb will pass in a list of files.  Unsure if that changes things materially, though.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7665/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7665

To: spectral, #hg-reviewers
Cc: martinvonz, mercurial-devel
phabricator - Dec. 17, 2019, 12:01 a.m.
martinvonz added inline comments.

INLINE COMMENTS

> spectral wrote in dirstate.py:611-620
> With 1 file in changedfiles and 200k files in allfiles, approximately 15x slower. That said, it's a little over 1.5ms in the old version, so in absolute terms this isn't a big difference until you get really huge repos.  I haven't really investigated when changedfiles is not None, if it's just for `hg debugrebuilddirstate --minimal`, I agree this isn't worth the complexity, but it looks like mq, strip, and absorb all call rebuild in some cases, and strip and absorb will pass in a list of files.  Unsure if that changes things materially, though.

Okay, that's at least a measurable difference, so I'm fine with the extra complexity.

> dirstate.py:623
> +            changedfilesset = set(changedfiles)
> +            to_lookup = changedfilesset.intersection(set(allfiles))
> +            to_drop = changedfilesset - to_lookup

We typically use the `&` operator instead of `.intersection()`. I'll change it in flight.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7665/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7665

To: spectral, #hg-reviewers
Cc: martinvonz, mercurial-devel

Patch

diff --git a/mercurial/dirstate.py b/mercurial/dirstate.py
--- a/mercurial/dirstate.py
+++ b/mercurial/dirstate.py
@@ -603,19 +603,34 @@ 
     def rebuild(self, parent, allfiles, changedfiles=None):
         if changedfiles is None:
             # Rebuild entire dirstate
-            changedfiles = allfiles
+            to_lookup = allfiles
+            to_drop = []
             lastnormaltime = self._lastnormaltime
             self.clear()
             self._lastnormaltime = lastnormaltime
+        elif len(changedfiles) < 10:
+            # Avoid turning allfiles into a set, which can be expensive if it's
+            # large.
+            to_lookup = []
+            to_drop = []
+            for f in changedfiles:
+                if f in allfiles:
+                    to_lookup.append(f)
+                else:
+                    to_drop.append(f)
+        else:
+            changedfilesset = set(changedfiles)
+            to_lookup = changedfilesset.intersection(set(allfiles))
+            to_drop = changedfilesset - to_lookup
 
         if self._origpl is None:
             self._origpl = self._pl
         self._map.setparents(parent, nullid)
-        for f in changedfiles:
-            if f in allfiles:
-                self.normallookup(f)
-            else:
-                self.drop(f)
+
+        for f in to_lookup:
+            self.normallookup(f)
+        for f in to_drop:
+            self.drop(f)
 
         self._dirty = True