Patchwork manifest: cache parsed fulltext during find

login
register
mail settings
Submitter Bryan O'Sullivan
Date March 11, 2013, 9:17 p.m.
Message ID <58ab71dcf9fbc1326267.1363036678@australite.thefacebook.com>
Download mbox | patch
Permalink /patch/1110/
State Rejected
Delegated to: Augie Fackler
Headers show

Comments

Bryan O'Sullivan - March 11, 2013, 9:17 p.m.
# HG changeset patch
# User Bryan O'Sullivan <bryano@fb.com>
# Date 1363036671 25200
# Node ID 58ab71dcf9fbc1326267811deb13074e595e9667
# Parent  15e77967d1b3bb5eb9dfb6e7bfe94691d280af74
manifest: cache parsed fulltext during find

We were looking up the manifest cache during find, but if it had somehow
not been primed prior to a call, we were *not* adding the looked-up
entry to it.

This led to a huge slowdown in "hg grep", where we would reconstruct a
manifest entry from scratch for every single file in every revision we
were visiting.

For a manifest containing 150,000 entries with 6,000 changes in one
revision, this change improves "hg grep" performance from 650 seconds
to 2. It seems likely to me that other commands could be affected, too,
but I haven't really looked at which.
Augie Fackler - March 12, 2013, 12:25 a.m.
On Mar 11, 2013, at 4:17 PM, Bryan O'Sullivan <bos@serpentine.com> wrote:

> # HG changeset patch
> # User Bryan O'Sullivan <bryano@fb.com>
> # Date 1363036671 25200
> # Node ID 58ab71dcf9fbc1326267811deb13074e595e9667
> # Parent  15e77967d1b3bb5eb9dfb6e7bfe94691d280af74
> manifest: cache parsed fulltext during find
> 
> We were looking up the manifest cache during find, but if it had somehow
> not been primed prior to a call, we were *not* adding the looked-up
> entry to it.

Ouch. Fix looks obviously correct, I'd say ship it.

> 
> This led to a huge slowdown in "hg grep", where we would reconstruct a
> manifest entry from scratch for every single file in every revision we
> were visiting.
> 
> For a manifest containing 150,000 entries with 6,000 changes in one
> revision, this change improves "hg grep" performance from 650 seconds
> to 2. It seems likely to me that other commands could be affected, too,
> but I haven't really looked at which.
> 
> diff --git a/mercurial/manifest.py b/mercurial/manifest.py
> --- a/mercurial/manifest.py
> +++ b/mercurial/manifest.py
> @@ -107,6 +107,7 @@ class manifest(revlog.revlog):
>             mapping = self._mancache[node][0]
>             return mapping.get(f), mapping.flags(f)
>         text = self.revision(node)
> +        self._mancache[node] = self.parse(text), array.array('c', text)
>         start, end = self._search(text, f)
>         if start == end:
>             return None, None
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Durham Goode - March 12, 2013, 12:25 a.m.
On 3/11/13 2:17 PM, "Bryan O'Sullivan" <bos@serpentine.com> wrote:
>
>diff --git a/mercurial/manifest.py b/mercurial/manifest.py
>--- a/mercurial/manifest.py
>+++ b/mercurial/manifest.py
>@@ -107,6 +107,7 @@ class manifest(revlog.revlog):
>             mapping = self._mancache[node][0]
>             return mapping.get(f), mapping.flags(f)
>         text = self.revision(node)
>+        self._mancache[node] = self.parse(text), array.array('c', text)
>         start, end = self._search(text, f)
>         if start == end:
>             return None, None

From looking at the find() function, it looks like it was written with the
purpose of not parsing the whole manifest.  So I worry that this fix will
have a negative affect somewhere else.  Looks like the find() function was
originally added as part of looking up tags
(http://selenic.com/hg/rev/dbdce3b99988).  'hg tags' reads the manifest
for each head and looks for the .hgtags file.  So 'hg tags' will now be
slower by (number of heads) * (speed of self.parse()).

If that perf hit is acceptable, we might as well just write find as:

def find(self, node, f):
    mapping = self.read(node)
    Return mapping.get(f), mapping.flags(f)

If that perf hit is not acceptable, we should just change 'hg grep' to use
manifest.read() and do the .get() and .flags() itself.

Durham
Bryan O'Sullivan - March 12, 2013, 4:57 a.m.
On Mon, Mar 11, 2013 at 5:25 PM, Durham Goode <durham@fb.com> wrote:

>
> From looking at the find() function, it looks like it was written with the
> purpose of not parsing the whole manifest.


Yep. It was written back when manifest parsing was performed entirely in
Python, and was consequently very slow. About two years later, I wrote the
C code for parsing the manifest.

At that point, the need to not parse the manifest was likely no longer an
issue, except for those hypothetical people using Mercurial without a C
compiler.

 So I worry that this fix will
> have a negative affect somewhere else.


I haven't been able to find such an effect. That said, I suspect that
manifest.find and manifest._search are entirely unnecessary now (except for
the --pure folks), and could be done away with.

If that perf hit is acceptable, we might as well just write find as:
>
> def find(self, node, f):
>     mapping = self.read(node)
>     return mapping.get(f), mapping.flags(f)
>

I don't think there is a perf hit, and the above ought to work fine.
Siddharth Agarwal - March 12, 2013, 10:46 a.m.
Yes, I'd drop find and _search altogether.

Bryan O'Sullivan <bos@serpentine.com> wrote:



On Mon, Mar 11, 2013 at 5:25 PM, Durham Goode <durham@fb.com<mailto:durham@fb.com>> wrote:

From looking at the find() function, it looks like it was written with the
purpose of not parsing the whole manifest.

Yep. It was written back when manifest parsing was performed entirely in Python, and was consequently very slow. About two years later, I wrote the C code for parsing the manifest.

At that point, the need to not parse the manifest was likely no longer an issue, except for those hypothetical people using Mercurial without a C compiler.

 So I worry that this fix will
have a negative affect somewhere else.

I haven't been able to find such an effect. That said, I suspect that manifest.find and manifest._search are entirely unnecessary now (except for the --pure folks), and could be done away with.

If that perf hit is acceptable, we might as well just write find as:

def find(self, node, f):
    mapping = self.read(node)
    return mapping.get(f), mapping.flags(f)

I don't think there is a perf hit, and the above ought to work fine.

Patch

diff --git a/mercurial/manifest.py b/mercurial/manifest.py
--- a/mercurial/manifest.py
+++ b/mercurial/manifest.py
@@ -107,6 +107,7 @@  class manifest(revlog.revlog):
             mapping = self._mancache[node][0]
             return mapping.get(f), mapping.flags(f)
         text = self.revision(node)
+        self._mancache[node] = self.parse(text), array.array('c', text)
         start, end = self._search(text, f)
         if start == end:
             return None, None