Patchwork [2,of,2] match: making visitdir() deal with non-recursive entries

login
register
mail settings
Submitter via Mercurial-devel
Date Feb. 10, 2017, 11:52 p.m.
Message ID <10d1ea213c7dd22b0843.1486770770@rdamazio.mtv.corp.google.com>
Download mbox | patch
Permalink /patch/18420/
State Superseded
Headers show

Comments

via Mercurial-devel - Feb. 10, 2017, 11:52 p.m.
# HG changeset patch
# User Rodrigo Damazio Bovendorp <rdamazio@google.com>
# Date 1486767895 28800
#      Fri Feb 10 15:04:55 2017 -0800
# Node ID 10d1ea213c7dd22b0843970eb88220d69b7c84cb
# Parent  2d9523f80c5b5fbace1b0566fb47bed7468369b0
match: making visitdir() deal with non-recursive entries

Primarily as an optimization to avoid recursing into directories that will
never have a match inside, this classifies each matcher pattern's root as
recursive or non-recursive (erring on the side of keeping it recursive,
which may lead to wasteful directory or manifest walks that yield no matches).

I measured the performance of "rootfilesin" in two repos:
- The Firefox repo with tree manifests, with
  "hg files -r . -I rootfilesin:browser".
  The browser directory contains about 3K files across 249 subdirectories.
- A specific Google-internal directory which contains 75K files across 19K
  subdirectories, with "hg files -r . -I rootfilesin:REDACTED".

I tested with both cold and warm disk caches. Cold cache was produced by
running "sync; echo 3 > /proc/sys/vm/drop_caches". Warm cache was produced
by re-running the same command a few times.

These were the results:

               Cold cache           Warm cache
             Before   After      Before   After
firefox      0m5.1s   0m2.18s   0m0.22s  0m0.14s
google3 dir  2m3.9s   0m1.57s   0m8.12s  0m0.16s

Certain extensions, notably narrowhg, can depend on this for correctness
(not trying to recurse into directories for which it has no information).
Katsunori FUJIWARA - Feb. 12, 2017, 7:23 p.m.
At Fri, 10 Feb 2017 15:52:50 -0800,
Rodrigo Damazio Bovendorp via Mercurial-devel wrote:
> 
> # HG changeset patch
> # User Rodrigo Damazio Bovendorp <rdamazio@google.com>
> # Date 1486767895 28800
> #      Fri Feb 10 15:04:55 2017 -0800
> # Node ID 10d1ea213c7dd22b0843970eb88220d69b7c84cb
> # Parent  2d9523f80c5b5fbace1b0566fb47bed7468369b0
> match: making visitdir() deal with non-recursive entries
> 
> Primarily as an optimization to avoid recursing into directories that will
> never have a match inside, this classifies each matcher pattern's root as
> recursive or non-recursive (erring on the side of keeping it recursive,
> which may lead to wasteful directory or manifest walks that yield no matches).
> 
> I measured the performance of "rootfilesin" in two repos:
> - The Firefox repo with tree manifests, with
>   "hg files -r . -I rootfilesin:browser".
>   The browser directory contains about 3K files across 249 subdirectories.
> - A specific Google-internal directory which contains 75K files across 19K
>   subdirectories, with "hg files -r . -I rootfilesin:REDACTED".
> 
> I tested with both cold and warm disk caches. Cold cache was produced by
> running "sync; echo 3 > /proc/sys/vm/drop_caches". Warm cache was produced
> by re-running the same command a few times.
> 
> These were the results:
> 
>                Cold cache           Warm cache
>              Before   After      Before   After
> firefox      0m5.1s   0m2.18s   0m0.22s  0m0.14s
> google3 dir  2m3.9s   0m1.57s   0m8.12s  0m0.16s
> 
> Certain extensions, notably narrowhg, can depend on this for correctness
> (not trying to recurse into directories for which it has no information).
>
> diff -r 2d9523f80c5b -r 10d1ea213c7d mercurial/match.py
> --- a/mercurial/match.py	Fri Feb 10 15:12:00 2017 -0800
> +++ b/mercurial/match.py	Fri Feb 10 15:04:55 2017 -0800
> @@ -125,9 +125,12 @@
>          self._always = False
>          self._pathrestricted = bool(include or exclude or patterns)
>          self._warn = warn
> +
> +        # roots are directories which are recursively included/excluded.
>          self._includeroots = set()
> +        self._excluderoots = set()
> +        # dirs are directories which are non-recursively included.
>          self._includedirs = set(['.'])
> -        self._excluderoots = set()
>  
>          if badfn is not None:
>              self.bad = badfn
> @@ -137,15 +140,22 @@
>              kindpats = self._normalize(include, 'glob', root, cwd, auditor)
>              self.includepat, im = _buildmatch(ctx, kindpats, '(?:/|$)',
>                                                listsubrepos, root)
> -            self._includeroots.update(_roots(kindpats))
> -            self._includedirs.update(util.dirs(self._includeroots))
> +            roots, dirs = _roots(kindpats)
> +            self._includeroots.update(roots)
> +            self._includedirs.update(dirs)
>              matchfns.append(im)
>          if exclude:
>              kindpats = self._normalize(exclude, 'glob', root, cwd, auditor)
>              self.excludepat, em = _buildmatch(ctx, kindpats, '(?:/|$)',
>                                                listsubrepos, root)
>              if not _anypats(kindpats):
> -                self._excluderoots.update(_roots(kindpats))
> +                # Only consider recursive excludes as such - if a non-recursive
> +                # exclude is used, we must still recurse into the excluded
> +                # directory, at least to find subdirectories. In such a case,
> +                # the regex still won't match the non-recursively-excluded
> +                # files.
> +                roots, dirs = _roots(kindpats)
> +                self._excluderoots.update(roots)
>              matchfns.append(lambda f: not em(f))
>          if exact:
>              if isinstance(patterns, list):
> @@ -156,6 +166,7 @@
>          elif patterns:
>              kindpats = self._normalize(patterns, default, root, cwd, auditor)
>              if not _kindpatsalwaysmatch(kindpats):
> +                roots, dirs = _roots(kindpats)

I couldn't find out user of "roots" and "dirs" above. Is this
intentional ?

>                  self._files = _explicitfiles(kindpats)
>                  self._anypats = self._anypats or _anypats(kindpats)
>                  self.patternspat, pm = _buildmatch(ctx, kindpats, '$',
> @@ -241,7 +252,7 @@
>              return 'all'
>          if dir in self._excluderoots:
>              return False
> -        if (self._includeroots and
> +        if ((self._includeroots or self._includedirs != set(['.'])) and
>              '.' not in self._includeroots and
>              dir not in self._includeroots and
>              dir not in self._includedirs and
> @@ -422,7 +433,9 @@
>          # m.exact(file) must be based off of the actual user input, otherwise
>          # inexact case matches are treated as exact, and not noted without -v.
>          if self._files:
> -            self._fileroots = set(_roots(self._kp))
> +            roots, dirs = _roots(self._kp)
> +            self._fileroots = set(roots)
> +            self._fileroots.update(dirs)
>  
>      def _normalize(self, patterns, default, root, cwd, auditor):
>          self._kp = super(icasefsmatcher, self)._normalize(patterns, default,
> @@ -622,18 +635,22 @@
>          raise error.Abort(_("invalid pattern"))
>  
>  def _roots(kindpats):
> -    '''return roots and exact explicitly listed files from patterns
> -
> -    >>> _roots([('glob', 'g/*', ''), ('glob', 'g', ''), ('glob', 'g*', '')])
> -    ['g', 'g', '.']
> -    >>> _roots([('rootfilesin', 'g', ''), ('rootfilesin', '', '')])
> -    ['g', '.']
> +    '''return roots and exact directories from patterns.
> +
> +    roots are directories to match recursively, whereas exact directories should
> +    be matched non-recursively.
> +
> +    >>> _roots([('glob', 'g/h/*', ''), ('glob', 'g/h', ''), ('glob', 'g*', '')])
> +    (['g/h', 'g/h', '.'], ['g'])
> +    >>> _roots([('rootfilesin', 'g/h', ''), ('rootfilesin', '', '')])
> +    ([], ['g/h', '.', 'g'])
>      >>> _roots([('relpath', 'r', ''), ('path', 'p/p', ''), ('path', '', '')])
> -    ['r', 'p/p', '.']
> +    (['r', 'p/p', '.'], ['p'])
>      >>> _roots([('relglob', 'rg*', ''), ('re', 're/', ''), ('relre', 'rr', '')])
> -    ['.', '.', '.']
> +    (['.', '.', '.'], [])
>      '''
>      r = []
> +    d = []
>      for kind, pat, source in kindpats:
>          if kind == 'glob': # find the non-glob prefix
>              root = []
> @@ -642,11 +659,17 @@
>                      break
>                  root.append(p)
>              r.append('/'.join(root) or '.')
> -        elif kind in ('relpath', 'path', 'rootfilesin'):
> +        elif kind in ('relpath', 'path'):
>              r.append(pat or '.')
> +        elif kind in ('rootfilesin'):
> +            d.append(pat or '.')
>          else: # relglob, re, relre
>              r.append('.')
> -    return r
> +
> +    # Append the parents as exact directories.
> +    d.extend(util.dirs(d))
> +    d.extend(util.dirs(r))
> +    return r, d

"d" (or "dirs" on caller side) is ignored at some of _roots()
invocations (especially in _explicitfiles()). In such case, applying
util.dirs() on "d" and "r" above is meaningless.

Can we utilize this util.dirs() on "d" and "r", and apply it only if
it is needed ?

  
>  def _explicitfiles(kindpats):
>      '''Returns the potential explicit filenames from the patterns.
> @@ -659,7 +682,8 @@
>      # Keep only the pattern kinds where one can specify filenames (vs only
>      # directory names).
>      filable = [kp for kp in kindpats if kp[0] not in ('rootfilesin')]
> -    return _roots(filable)
> +    roots, dirs = _roots(filable)
> +    return roots
>  
>  def _anypats(kindpats):
>      for kind, pat, source in kindpats:
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Patch

diff -r 2d9523f80c5b -r 10d1ea213c7d mercurial/match.py
--- a/mercurial/match.py	Fri Feb 10 15:12:00 2017 -0800
+++ b/mercurial/match.py	Fri Feb 10 15:04:55 2017 -0800
@@ -125,9 +125,12 @@ 
         self._always = False
         self._pathrestricted = bool(include or exclude or patterns)
         self._warn = warn
+
+        # roots are directories which are recursively included/excluded.
         self._includeroots = set()
+        self._excluderoots = set()
+        # dirs are directories which are non-recursively included.
         self._includedirs = set(['.'])
-        self._excluderoots = set()
 
         if badfn is not None:
             self.bad = badfn
@@ -137,15 +140,22 @@ 
             kindpats = self._normalize(include, 'glob', root, cwd, auditor)
             self.includepat, im = _buildmatch(ctx, kindpats, '(?:/|$)',
                                               listsubrepos, root)
-            self._includeroots.update(_roots(kindpats))
-            self._includedirs.update(util.dirs(self._includeroots))
+            roots, dirs = _roots(kindpats)
+            self._includeroots.update(roots)
+            self._includedirs.update(dirs)
             matchfns.append(im)
         if exclude:
             kindpats = self._normalize(exclude, 'glob', root, cwd, auditor)
             self.excludepat, em = _buildmatch(ctx, kindpats, '(?:/|$)',
                                               listsubrepos, root)
             if not _anypats(kindpats):
-                self._excluderoots.update(_roots(kindpats))
+                # Only consider recursive excludes as such - if a non-recursive
+                # exclude is used, we must still recurse into the excluded
+                # directory, at least to find subdirectories. In such a case,
+                # the regex still won't match the non-recursively-excluded
+                # files.
+                roots, dirs = _roots(kindpats)
+                self._excluderoots.update(roots)
             matchfns.append(lambda f: not em(f))
         if exact:
             if isinstance(patterns, list):
@@ -156,6 +166,7 @@ 
         elif patterns:
             kindpats = self._normalize(patterns, default, root, cwd, auditor)
             if not _kindpatsalwaysmatch(kindpats):
+                roots, dirs = _roots(kindpats)
                 self._files = _explicitfiles(kindpats)
                 self._anypats = self._anypats or _anypats(kindpats)
                 self.patternspat, pm = _buildmatch(ctx, kindpats, '$',
@@ -241,7 +252,7 @@ 
             return 'all'
         if dir in self._excluderoots:
             return False
-        if (self._includeroots and
+        if ((self._includeroots or self._includedirs != set(['.'])) and
             '.' not in self._includeroots and
             dir not in self._includeroots and
             dir not in self._includedirs and
@@ -422,7 +433,9 @@ 
         # m.exact(file) must be based off of the actual user input, otherwise
         # inexact case matches are treated as exact, and not noted without -v.
         if self._files:
-            self._fileroots = set(_roots(self._kp))
+            roots, dirs = _roots(self._kp)
+            self._fileroots = set(roots)
+            self._fileroots.update(dirs)
 
     def _normalize(self, patterns, default, root, cwd, auditor):
         self._kp = super(icasefsmatcher, self)._normalize(patterns, default,
@@ -622,18 +635,22 @@ 
         raise error.Abort(_("invalid pattern"))
 
 def _roots(kindpats):
-    '''return roots and exact explicitly listed files from patterns
-
-    >>> _roots([('glob', 'g/*', ''), ('glob', 'g', ''), ('glob', 'g*', '')])
-    ['g', 'g', '.']
-    >>> _roots([('rootfilesin', 'g', ''), ('rootfilesin', '', '')])
-    ['g', '.']
+    '''return roots and exact directories from patterns.
+
+    roots are directories to match recursively, whereas exact directories should
+    be matched non-recursively.
+
+    >>> _roots([('glob', 'g/h/*', ''), ('glob', 'g/h', ''), ('glob', 'g*', '')])
+    (['g/h', 'g/h', '.'], ['g'])
+    >>> _roots([('rootfilesin', 'g/h', ''), ('rootfilesin', '', '')])
+    ([], ['g/h', '.', 'g'])
     >>> _roots([('relpath', 'r', ''), ('path', 'p/p', ''), ('path', '', '')])
-    ['r', 'p/p', '.']
+    (['r', 'p/p', '.'], ['p'])
     >>> _roots([('relglob', 'rg*', ''), ('re', 're/', ''), ('relre', 'rr', '')])
-    ['.', '.', '.']
+    (['.', '.', '.'], [])
     '''
     r = []
+    d = []
     for kind, pat, source in kindpats:
         if kind == 'glob': # find the non-glob prefix
             root = []
@@ -642,11 +659,17 @@ 
                     break
                 root.append(p)
             r.append('/'.join(root) or '.')
-        elif kind in ('relpath', 'path', 'rootfilesin'):
+        elif kind in ('relpath', 'path'):
             r.append(pat or '.')
+        elif kind in ('rootfilesin'):
+            d.append(pat or '.')
         else: # relglob, re, relre
             r.append('.')
-    return r
+
+    # Append the parents as exact directories.
+    d.extend(util.dirs(d))
+    d.extend(util.dirs(r))
+    return r, d
 
 def _explicitfiles(kindpats):
     '''Returns the potential explicit filenames from the patterns.
@@ -659,7 +682,8 @@ 
     # Keep only the pattern kinds where one can specify filenames (vs only
     # directory names).
     filable = [kp for kp in kindpats if kp[0] not in ('rootfilesin')]
-    return _roots(filable)
+    roots, dirs = _roots(filable)
+    return roots
 
 def _anypats(kindpats):
     for kind, pat, source in kindpats: