Patchwork util.fspath: use a dict rather than a linear scan for lookups

login
register
mail settings
Submitter Siddharth Agarwal
Date Oct. 27, 2014, 10:04 p.m.
Message ID <4649d2f769c1404cc9a1.1414447453@devbig136.prn2.facebook.com>
Download mbox | patch
Permalink /patch/6477/
State Accepted
Headers show

Comments

Siddharth Agarwal - Oct. 27, 2014, 10:04 p.m.
# HG changeset patch
# User Siddharth Agarwal <sid0@fb.com>
# Date 1414175979 25200
#      Fri Oct 24 11:39:39 2014 -0700
# Branch stable
# Node ID 4649d2f769c1404cc9a1cda1f16dceef4e3d2451
# Parent  24600c9d7f4eb02f10b365384ab805bae16027b7
util.fspath: use a dict rather than a linear scan for lookups

Previously, we'd scan through the entire directory listing looking for a
normalized match.  This is O(N) in the number of files in the directory. If we
decide to call util.fspath on each file in it, the overall complexity works out
to O(N^2). This becomes a problem with directories a few thousand files or
larger.

Switch to using a dictionary instead. There is a slightly higher upfront cost
to pay, but for cases like the above this is amortized O(1). Plus there is a
lower constant factor because generator comprehensions are faster than for
loops, so overall it works out to be a very small loss in performance for 1
file, and a huge gain when there's more.

For a large repo with around 200k files in it on a case-insensitive file
system, for a large directory with over 30,000 files in it, the following
command was tested:

ls | shuf -n $COUNT | xargs hg status

This command leads to util.fspath being called on $COUNT files in the
directory.

COUNT  before  after
    1   0.77s  0.78s
  100   1.42s  0.80s
 1000    6.3s  0.96s

I also tested with COUNT=10000, but before took too long so I gave up.
Matt Mackall - Oct. 27, 2014, 10:11 p.m.
On Mon, 2014-10-27 at 15:04 -0700, Siddharth Agarwal wrote:
> # HG changeset patch
> # User Siddharth Agarwal <sid0@fb.com>
> # Date 1414175979 25200
> #      Fri Oct 24 11:39:39 2014 -0700
> # Branch stable
> # Node ID 4649d2f769c1404cc9a1cda1f16dceef4e3d2451
> # Parent  24600c9d7f4eb02f10b365384ab805bae16027b7
> util.fspath: use a dict rather than a linear scan for lookups

Queued for stable, thanks.

Patch

diff --git a/mercurial/util.py b/mercurial/util.py
--- a/mercurial/util.py
+++ b/mercurial/util.py
@@ -899,11 +899,8 @@ 
 
     The root should be normcase-ed, too.
     '''
-    def find(p, contents):
-        for n in contents:
-            if normcase(n) == p:
-                return n
-        return None
+    def _makefspathcacheentry(dir):
+        return dict((normcase(n), n) for n in os.listdir(dir))
 
     seps = os.sep
     if os.altsep:
@@ -919,16 +916,15 @@ 
             continue
 
         if dir not in _fspathcache:
-            _fspathcache[dir] = os.listdir(dir)
+            _fspathcache[dir] = _makefspathcacheentry(dir)
         contents = _fspathcache[dir]
 
-        found = find(part, contents)
+        found = contents.get(part)
         if not found:
             # retry "once per directory" per "dirstate.walk" which
             # may take place for each patches of "hg qpush", for example
-            contents = os.listdir(dir)
-            _fspathcache[dir] = contents
-            found = find(part, contents)
+            _fspathcache[dir] = contents = _makefspathcacheentry(dir)
+            found = contents.get(part)
 
         result.append(found or part)
         dir = os.path.join(dir, part)