Patchwork revset: use "repo.changelog" instead of "list(repo)" for resource efficiency

login
register
mail settings
Submitter Katsunori FUJIWARA
Date Feb. 17, 2013, 4:08 p.m.
Message ID <06fb2582618f43400b38.1361117286@juju>
Download mbox | patch
Permalink /patch/1027/
State Changes Requested
Headers show

Comments

Katsunori FUJIWARA - Feb. 17, 2013, 4:08 p.m.
# HG changeset patch
# User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
# Date 1361115858 -32400
# Node ID 06fb2582618f43400b38ac4c791410d2fc748d2c
# Parent  f12804d3ff801b989cb2aab1aad93047a8db46f1
revset: use "repo.changelog" instead of "list(repo)" for resource efficiency

Before this patch, some revset predicates use "list(repo)" as "subset"
to indicate "all revisions in the repo", and it immediately creates
the list object containing such revisions, even though it may not be
used: for example, "stringset" omits checking whether the revision
number corresponding to the specified string is contained in "subset"
or not, if "len(subset) == len(repo)".

This patch uses "repo.changelog" instead of "list(repo)" for resource
efficiency.
Bryan O'Sullivan - Feb. 19, 2013, 6:27 p.m.
On Sun, Feb 17, 2013 at 8:08 AM, FUJIWARA Katsunori
<foozy@lares.dti.ne.jp>wrote:

> This patch uses "repo.changelog" instead of "list(repo)" for resource
> efficiency.
>

What effect does this have on performance?
Dirkjan Ochtman - Feb. 19, 2013, 7:08 p.m.
On Sun, Feb 17, 2013 at 5:08 PM, FUJIWARA Katsunori
<foozy@lares.dti.ne.jp> wrote:
> # HG changeset patch
> # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> # Date 1361115858 -32400
> # Node ID 06fb2582618f43400b38ac4c791410d2fc748d2c
> # Parent  f12804d3ff801b989cb2aab1aad93047a8db46f1
> revset: use "repo.changelog" instead of "list(repo)" for resource efficiency

Doesn't that kill the filtering? IIRC Pierre-Yves was moving in the
other direction on these kinds of things.

Cheers,

Dirkjan
Katsunori FUJIWARA - Feb. 20, 2013, 1:02 p.m.
At Tue, 19 Feb 2013 10:27:40 -0800,
Bryan O'Sullivan wrote:
> 
> [1  <text/plain; UTF-8 (7bit)>]
> On Sun, Feb 17, 2013 at 8:08 AM, FUJIWARA Katsunori
> <foozy@lares.dti.ne.jp>wrote:
> 
> > This patch uses "repo.changelog" instead of "list(repo)" for resource
> > efficiency.
> >
> 
> What effect does this have on performance?

I expected this patch to prevent from immediate creation of "list"
object containing integers corresponded to all revisions in
repository, because it may be very large and not be used.

For example, "revset.stringset()" omits "x in subset" check, if
"len(subset) == len(repo)": "len(list(repo)) == len(repo)" is always
True, if there is no hidden revision.

But, I overlooked that:

  - changelog has "__iter__()" but not "__contains__()". it decreases
    performance, because each "x in cl" causes "__iter__()" invocation

  - some codes outside revset module also invoke "revset.getset()"
    with "list(repo)"

IMHO, "repo" itself seems to be good to take place of "list(repo)":
"repo" should have all of "__iter__()", "__len__()" and
"__contains_()".
    
I'll try to resend refined one.

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy@lares.dti.ne.jp
Katsunori FUJIWARA - Feb. 20, 2013, 1:02 p.m.
At Tue, 19 Feb 2013 13:31:54 -0600,
Kevin Bullock wrote:
> 
> On Feb 19, 2013, at 1:08 PM, Dirkjan Ochtman wrote:
> 
> > On Sun, Feb 17, 2013 at 5:08 PM, FUJIWARA Katsunori
> > <foozy@lares.dti.ne.jp> wrote:
> >> # HG changeset patch
> >> # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> >> # Date 1361115858 -32400
> >> # Node ID 06fb2582618f43400b38ac4c791410d2fc748d2c
> >> # Parent  f12804d3ff801b989cb2aab1aad93047a8db46f1
> >> revset: use "repo.changelog" instead of "list(repo)" for resource efficiency
> > 
> > Doesn't that kill the filtering? IIRC Pierre-Yves was moving in the
> > other direction on these kinds of things.
> 
> Nope, repo.changelog is filtered on a filtered repo object.

Yes, repo.changelog is filtered.

But I found my own mistake around hidden revisions.

"list(repo)" uses "changelog.__iter__()" to create list object as
subset.

So, "len(subset) == len(repo)" is not True, when there are some hidden
revisions. This causes "x in subset" check in "stringset()" below:

    def stringset(repo, subset, x):
        x = repo[x].rev()
        if x == -1 and len(subset) == len(repo):
            return [-1]
        if len(subset) == len(repo) or x in subset:
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            return [x]
        return []

But "repo.changelog" as subset causes "len(subset) == len(repo)"
always, so ID of hidden revision specified directly in revset
expression is treated as valid one, even if --hidden is not specified
in command line.

So, this patch should also add "repo.filteredrevs" check into above
code.

I'll try to resend refined one.

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy@lares.dti.ne.jp
Bryan O'Sullivan - Feb. 20, 2013, 6:52 p.m.
On Wed, Feb 20, 2013 at 5:02 AM, FUJIWARA Katsunori
<foozy@lares.dti.ne.jp>wrote:

> I expected this patch to prevent from immediate creation of "list"
> object containing integers corresponded to all revisions in
> repository, because it may be very large and not be used.
>

The contrib/perf.py extension is useful for quantifying this sort of thing.
Revsets are very performance sensitive, and it's all too easy to regress
performance if you don't measure it. I'd appreciate it if you could post
some before/after numbers for this patch.

Patch

diff --git a/mercurial/revset.py b/mercurial/revset.py
--- a/mercurial/revset.py
+++ b/mercurial/revset.py
@@ -239,7 +239,7 @@ 
 
 def dagrange(repo, subset, x, y):
     if subset:
-        r = list(repo)
+        r = repo.changelog
         xs = _revsbetween(repo, getset(repo, r, x), getset(repo, r, y))
         s = set(subset)
         return [r for r in xs if r in s]
@@ -286,14 +286,14 @@ 
     """
     # i18n: "ancestor" is a keyword
     l = getlist(x)
-    rl = list(repo)
+    cl = repo.changelog
     anc = None
 
-    # (getset(repo, rl, i) for i in l) generates a list of lists
+    # (getset(repo, cl, i) for i in l) generates a list of lists
     rev = repo.changelog.rev
     ancestor = repo.changelog.ancestor
     node = repo.changelog.node
-    for revs in (getset(repo, rl, i) for i in l):
+    for revs in (getset(repo, cl, i) for i in l):
         for r in revs:
             if anc is None:
                 anc = r
@@ -305,7 +305,7 @@ 
     return []
 
 def _ancestors(repo, subset, x, followfirst=False):
-    args = getset(repo, list(repo), x)
+    args = getset(repo, repo.changelog, x)
     if not args:
         return []
     s = set(_revancestors(repo, args, followfirst)) | set(args)
@@ -432,7 +432,7 @@ 
         else:
             return [r for r in subset if matcher(repo[r].branch())]
 
-    s = getset(repo, list(repo), x)
+    s = getset(repo, repo.changelog, x)
     b = set()
     for r in s:
         b.add(repo[r].branch())
@@ -511,7 +511,7 @@ 
     """``children(set)``
     Child changesets of changesets in set.
     """
-    s = set(getset(repo, list(repo), x))
+    s = set(getset(repo, repo.changelog, x))
     cs = _children(repo, subset, s)
     return [r for r in subset if r in cs]
 
@@ -592,7 +592,7 @@ 
     return l
 
 def _descendants(repo, subset, x, followfirst=False):
-    args = getset(repo, list(repo), x)
+    args = getset(repo, repo.changelog, x)
     if not args:
         return []
     s = set(_revdescendants(repo, args, followfirst)) | set(args)
@@ -616,9 +616,9 @@ 
     is the same as passing all().
     """
     if x is not None:
-        args = set(getset(repo, list(repo), x))
+        args = set(getset(repo, repo.changelog, x))
     else:
-        args = set(getall(repo, list(repo), x))
+        args = set(getall(repo, repo.changelog, x))
 
     dests = set()
 
@@ -932,7 +932,7 @@ 
         # i18n: "limit" is a keyword
         raise error.ParseError(_("limit expects a number"))
     ss = set(subset)
-    os = getset(repo, list(repo), l[0])[:lim]
+    os = getset(repo, repo.changelog, l[0])[:lim]
     return [r for r in os if r in ss]
 
 def last(repo, subset, x):
@@ -950,14 +950,14 @@ 
         # i18n: "last" is a keyword
         raise error.ParseError(_("last expects a number"))
     ss = set(subset)
-    os = getset(repo, list(repo), l[0])[-lim:]
+    os = getset(repo, repo.changelog, l[0])[-lim:]
     return [r for r in os if r in ss]
 
 def maxrev(repo, subset, x):
     """``max(set)``
     Changeset with highest revision number in set.
     """
-    os = getset(repo, list(repo), x)
+    os = getset(repo, repo.changelog, x)
     if os:
         m = max(os)
         if m in subset:
@@ -994,7 +994,7 @@ 
     """``min(set)``
     Changeset with lowest revision number in set.
     """
-    os = getset(repo, list(repo), x)
+    os = getset(repo, repo.changelog, x)
     if os:
         m = min(os)
         if m in subset:
@@ -1044,9 +1044,9 @@ 
     for the first operation is selected.
     """
     if x is not None:
-        args = set(getset(repo, list(repo), x))
+        args = set(getset(repo, repo.changelog, x))
     else:
-        args = set(getall(repo, list(repo), x))
+        args = set(getall(repo, repo.changelog, x))
 
     def _firstsrc(rev):
         src = _getrevsource(repo, rev)
@@ -1096,7 +1096,7 @@ 
 
     ps = set()
     cl = repo.changelog
-    for r in getset(repo, list(repo), x):
+    for r in getset(repo, cl, x):
         ps.add(cl.parentrevs(r)[0])
     return [r for r in subset if r in ps]
 
@@ -1114,7 +1114,7 @@ 
 
     ps = set()
     cl = repo.changelog
-    for r in getset(repo, list(repo), x):
+    for r in getset(repo, cl, x):
         ps.add(cl.parentrevs(r)[1])
     return [r for r in subset if r in ps]
 
@@ -1128,7 +1128,7 @@ 
 
     ps = set()
     cl = repo.changelog
-    for r in getset(repo, list(repo), x):
+    for r in getset(repo, cl, x):
         ps.update(cl.parentrevs(r))
     return [r for r in subset if r in ps]