Patchwork [2,of,2] strip: make tree stripping O(changes) instead of O(repo)

login
register
mail settings
Submitter Durham Goode
Date May 8, 2017, 6:40 p.m.
Message ID <74881b9a39b2bab273d0.1494268803@dev111.prn1.facebook.com>
Download mbox | patch
Permalink /patch/20524/
State Accepted
Headers show

Comments

Durham Goode - May 8, 2017, 6:40 p.m.
# HG changeset patch
# User Durham Goode <durham@fb.com>
# Date 1494268523 25200
#      Mon May 08 11:35:23 2017 -0700
# Node ID 74881b9a39b2bab273d09009385e3c9ca717a13a
# Parent  5dec5907fe49a488d3ade272d4a5cf090914e59c
strip: make tree stripping O(changes) instead of O(repo)

The old tree stripping logic iterated over every tree revlog in the repo looking
for commits that had revs to be stripped. That's very inefficient in large
repos. Instead, let's look at what files are touched by the strip and only
inspect those revlogs.

I don't have actual perf numbers, since internally we don't use a true
treemanifest, but simply iterating over hundreds of thousands of revlogs takes
many, many seconds, so this should help tremendously when stripping only a few
commits.
Augie Fackler - May 8, 2017, 7:06 p.m.
On Mon, May 08, 2017 at 11:40:03AM -0700, Durham Goode wrote:
> # HG changeset patch
> # User Durham Goode <durham@fb.com>
> # Date 1494268523 25200
> #      Mon May 08 11:35:23 2017 -0700
> # Node ID 74881b9a39b2bab273d09009385e3c9ca717a13a
> # Parent  5dec5907fe49a488d3ade272d4a5cf090914e59c
> strip: make tree stripping O(changes) instead of O(repo)

queued, thanks
via Mercurial-devel - May 15, 2017, 8:32 p.m.
On Mon, May 8, 2017 at 11:40 AM, Durham Goode <durham@fb.com> wrote:
> # HG changeset patch
> # User Durham Goode <durham@fb.com>
> # Date 1494268523 25200
> #      Mon May 08 11:35:23 2017 -0700
> # Node ID 74881b9a39b2bab273d09009385e3c9ca717a13a
> # Parent  5dec5907fe49a488d3ade272d4a5cf090914e59c
> strip: make tree stripping O(changes) instead of O(repo)
>
> The old tree stripping logic iterated over every tree revlog in the repo looking
> for commits that had revs to be stripped. That's very inefficient in large
> repos. Instead, let's look at what files are touched by the strip and only
> inspect those revlogs.

Sorry, I didn't notice this patch until today (because bisection of a
test failure in narrowhg pointed me to it). This patch makes me a
little worried that it'll have the same bugs as the initial version of
changegroup generation for treemanifests had, i.e. that it forgets
that merge commits can affect no files, but still affect directories
(fixed in commit 1ac8ce13). I haven't tried to see if I can make it
fail with this patch applied (the narrowhg failures was unrelated).

>
> I don't have actual perf numbers, since internally we don't use a true
> treemanifest, but simply iterating over hundreds of thousands of revlogs takes
> many, many seconds, so this should help tremendously when stripping only a few
> commits.
>
> diff --git a/mercurial/repair.py b/mercurial/repair.py
> --- a/mercurial/repair.py
> +++ b/mercurial/repair.py
> @@ -238,11 +238,12 @@ def strip(ui, repo, nodelist, backup=Tru
>  def striptrees(repo, tr, striprev, files):
>      if 'treemanifest' in repo.requirements: # safe but unnecessary
>                                              # otherwise
> -        for unencoded, encoded, size in repo.store.datafiles():
> -            if (unencoded.startswith('meta/') and
> -                unencoded.endswith('00manifest.i')):
> -                dir = unencoded[5:-12]
> -                repo.manifestlog._revlog.dirlog(dir).strip(striprev, tr)
> +        treerevlog = repo.manifestlog._revlog
> +        for dir in util.dirs(files):

Note that util.dirs() does not return the root directory (I've often
wanted to change that, and I may send a patch soon), so do you need to
manually include it here?

> +            # If the revlog doesn't exist, this returns an empty revlog and is a
> +            # no-op.
> +            rl = treerevlog.dirlog(dir)
> +            rl.strip(striprev, tr)
>
>  def rebuildfncache(ui, repo):
>      """Rebuilds the fncache file from repo history.
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Durham Goode - May 15, 2017, 8:39 p.m.
On 5/15/17 1:32 PM, Martin von Zweigbergk wrote:
> On Mon, May 8, 2017 at 11:40 AM, Durham Goode <durham@fb.com> wrote:
>> # HG changeset patch
>> # User Durham Goode <durham@fb.com>
>> # Date 1494268523 25200
>> #      Mon May 08 11:35:23 2017 -0700
>> # Node ID 74881b9a39b2bab273d09009385e3c9ca717a13a
>> # Parent  5dec5907fe49a488d3ade272d4a5cf090914e59c
>> strip: make tree stripping O(changes) instead of O(repo)
>>
>> The old tree stripping logic iterated over every tree revlog in the repo looking
>> for commits that had revs to be stripped. That's very inefficient in large
>> repos. Instead, let's look at what files are touched by the strip and only
>> inspect those revlogs.
>
> Sorry, I didn't notice this patch until today (because bisection of a
> test failure in narrowhg pointed me to it). This patch makes me a
> little worried that it'll have the same bugs as the initial version of
> changegroup generation for treemanifests had, i.e. that it forgets
> that merge commits can affect no files, but still affect directories
> (fixed in commit 1ac8ce13). I haven't tried to see if I can make it
> fail with this patch applied (the narrowhg failures was unrelated).

Ug, good point. I think this patch will fail in that case. Iterating 
over all the revlogs really isn't a scalable option though, so it sounds 
like maybe we need to actually do a walk of the trees for manifests that 
are being stripped.  Like, for each tree being stripped, diff them with 
their parents and return all directories that are new.

We could probably add an argument to walksubtrees that only yields 
subtrees that are different from the argument tree. This type of logic 
is useful for determining what trees need to be bundled too. We do 
something like this in our native implementation.


>>
>> diff --git a/mercurial/repair.py b/mercurial/repair.py
>> --- a/mercurial/repair.py
>> +++ b/mercurial/repair.py
>> @@ -238,11 +238,12 @@ def strip(ui, repo, nodelist, backup=Tru
>>  def striptrees(repo, tr, striprev, files):
>>      if 'treemanifest' in repo.requirements: # safe but unnecessary
>>                                              # otherwise
>> -        for unencoded, encoded, size in repo.store.datafiles():
>> -            if (unencoded.startswith('meta/') and
>> -                unencoded.endswith('00manifest.i')):
>> -                dir = unencoded[5:-12]
>> -                repo.manifestlog._revlog.dirlog(dir).strip(striprev, tr)
>> +        treerevlog = repo.manifestlog._revlog
>> +        for dir in util.dirs(files):
>
> Note that util.dirs() does not return the root directory (I've often
> wanted to change that, and I may send a patch soon), so do you need to
> manually include it here?

The root manifest is already handled by the normal strip mechanism, so 
no need to handle it here.
via Mercurial-devel - May 15, 2017, 9:19 p.m.
On Mon, May 15, 2017 at 1:39 PM, Durham Goode <durham@fb.com> wrote:
>
>
> On 5/15/17 1:32 PM, Martin von Zweigbergk wrote:
>>
>> On Mon, May 8, 2017 at 11:40 AM, Durham Goode <durham@fb.com> wrote:
>>>
>>> # HG changeset patch
>>> # User Durham Goode <durham@fb.com>
>>> # Date 1494268523 25200
>>> #      Mon May 08 11:35:23 2017 -0700
>>> # Node ID 74881b9a39b2bab273d09009385e3c9ca717a13a
>>> # Parent  5dec5907fe49a488d3ade272d4a5cf090914e59c
>>> strip: make tree stripping O(changes) instead of O(repo)
>>>
>>> The old tree stripping logic iterated over every tree revlog in the repo
>>> looking
>>> for commits that had revs to be stripped. That's very inefficient in
>>> large
>>> repos. Instead, let's look at what files are touched by the strip and
>>> only
>>> inspect those revlogs.
>>
>>
>> Sorry, I didn't notice this patch until today (because bisection of a
>> test failure in narrowhg pointed me to it). This patch makes me a
>> little worried that it'll have the same bugs as the initial version of
>> changegroup generation for treemanifests had, i.e. that it forgets
>> that merge commits can affect no files, but still affect directories
>> (fixed in commit 1ac8ce13). I haven't tried to see if I can make it
>> fail with this patch applied (the narrowhg failures was unrelated).
>
>
> Ug, good point. I think this patch will fail in that case. Iterating over
> all the revlogs really isn't a scalable option though, so it sounds like
> maybe we need to actually do a walk of the trees for manifests that are
> being stripped.  Like, for each tree being stripped, diff them with their
> parents and return all directories that are new.

Yep, that's exactly what I did for changegroup generation in 1ac8ce13.

For now, could you send a patch to back this change out and I'll queue it?

>
> We could probably add an argument to walksubtrees that only yields subtrees
> that are different from the argument tree. This type of logic is useful for
> determining what trees need to be bundled too. We do something like this in
> our native implementation.
>
>
>>>
>>> diff --git a/mercurial/repair.py b/mercurial/repair.py
>>> --- a/mercurial/repair.py
>>> +++ b/mercurial/repair.py
>>> @@ -238,11 +238,12 @@ def strip(ui, repo, nodelist, backup=Tru
>>>  def striptrees(repo, tr, striprev, files):
>>>      if 'treemanifest' in repo.requirements: # safe but unnecessary
>>>                                              # otherwise
>>> -        for unencoded, encoded, size in repo.store.datafiles():
>>> -            if (unencoded.startswith('meta/') and
>>> -                unencoded.endswith('00manifest.i')):
>>> -                dir = unencoded[5:-12]
>>> -                repo.manifestlog._revlog.dirlog(dir).strip(striprev, tr)
>>> +        treerevlog = repo.manifestlog._revlog
>>> +        for dir in util.dirs(files):
>>
>>
>> Note that util.dirs() does not return the root directory (I've often
>> wanted to change that, and I may send a patch soon), so do you need to
>> manually include it here?
>
>
> The root manifest is already handled by the normal strip mechanism, so no
> need to handle it here.

Patch

diff --git a/mercurial/repair.py b/mercurial/repair.py
--- a/mercurial/repair.py
+++ b/mercurial/repair.py
@@ -238,11 +238,12 @@  def strip(ui, repo, nodelist, backup=Tru
 def striptrees(repo, tr, striprev, files):
     if 'treemanifest' in repo.requirements: # safe but unnecessary
                                             # otherwise
-        for unencoded, encoded, size in repo.store.datafiles():
-            if (unencoded.startswith('meta/') and
-                unencoded.endswith('00manifest.i')):
-                dir = unencoded[5:-12]
-                repo.manifestlog._revlog.dirlog(dir).strip(striprev, tr)
+        treerevlog = repo.manifestlog._revlog
+        for dir in util.dirs(files):
+            # If the revlog doesn't exist, this returns an empty revlog and is a
+            # no-op.
+            rl = treerevlog.dirlog(dir)
+            rl.strip(striprev, tr)
 
 def rebuildfncache(ui, repo):
     """Rebuilds the fncache file from repo history.