Patchwork [4,of,4] dirstate: use a presized dict for the dirstate

login
register
mail settings
Submitter Siddharth Agarwal
Date June 16, 2015, 7:55 a.m.
Message ID <ca98a16ff9d5b1ef9b54.1434441357@devbig136.prn2.facebook.com>
Download mbox | patch
Permalink /patch/9661/
State Accepted
Headers show

Comments

Siddharth Agarwal - June 16, 2015, 7:55 a.m.
# HG changeset patch
# User Siddharth Agarwal <sid0@fb.com>
# Date 1434440761 25200
#      Tue Jun 16 00:46:01 2015 -0700
# Node ID ca98a16ff9d5b1ef9b54503fde91edd928d3a88e
# Parent  7b8f6849ec9d2e5f20d7f452e448ad2720e505b7
dirstate: use a presized dict for the dirstate

This uses a simple heuristic to avoid expensive resizes.

On a real-world repo with around 400,000 files, perfdirstate:

before: ! wall 0.155562 comb 0.160000 user 0.150000 sys 0.010000 (best of 64)
after:  ! wall 0.132638 comb 0.130000 user 0.120000 sys 0.010000 (best of 75)

On another real-world repo with around 250,000 files:

before: ! wall 0.098459 comb 0.100000 user 0.090000 sys 0.010000 (best of 100)
after:  ! wall 0.089084 comb 0.090000 user 0.080000 sys 0.010000 (best of 100)
Pierre-Yves David - June 16, 2015, 7:36 p.m.
On 06/16/2015 12:55 AM, Siddharth Agarwal wrote:
> # HG changeset patch
> # User Siddharth Agarwal <sid0@fb.com>
> # Date 1434440761 25200
> #      Tue Jun 16 00:46:01 2015 -0700
> # Node ID ca98a16ff9d5b1ef9b54503fde91edd928d3a88e
> # Parent  7b8f6849ec9d2e5f20d7f452e448ad2720e505b7
> dirstate: use a presized dict for the dirstate

Nice, these are pushed to the clowncopter.

Patch

diff --git a/mercurial/dirstate.py b/mercurial/dirstate.py
--- a/mercurial/dirstate.py
+++ b/mercurial/dirstate.py
@@ -338,6 +338,19 @@  class dirstate(object):
         if not st:
             return
 
+        if util.safehasattr(parsers, 'dict_new_presized'):
+            # Make an estimate of the number of files in the dirstate based on
+            # its size. From a linear regression on a set of real-world repos,
+            # all over 10,000 files, the size of a dirstate entry is 85
+            # bytes. The cost of resizing is significantly higher than the cost
+            # of filling in a larger presized dict, so subtract 20% from the
+            # size.
+            #
+            # This heuristic is imperfect in many ways, so in a future dirstate
+            # format update it makes sense to just record the number of entries
+            # on write.
+            self._map = parsers.dict_new_presized(len(st) / 71)
+
         # Python's garbage collector triggers a GC each time a certain number
         # of container objects (the number being defined by
         # gc.get_threshold()) are allocated. parse_dirstate creates a tuple