Patchwork [1,of,2] revlog: add __contains__ for fast membership test

login
register
mail settings
Submitter Yuya Nishihara
Date Feb. 4, 2015, 1:27 p.m.
Message ID <f65a44d4905a5b4767e4.1423056423@mimosa>
Download mbox | patch
Permalink /patch/7657/
State Accepted
Headers show

Comments

Yuya Nishihara - Feb. 4, 2015, 1:27 p.m.
# HG changeset patch
# User Yuya Nishihara <yuya@tcha.org>
# Date 1423052757 -32400
#      Wed Feb 04 21:25:57 2015 +0900
# Node ID f65a44d4905a5b4767e4b901de6ed1efe703611f
# Parent  e1dbe0b215ae137eec53ceb12440536d570a83d2
revlog: add __contains__ for fast membership test

Because revlog implements __iter__, "rev in revlog" works but does silly O(n)
lookup unexpectedly. So it seems good to add fast version of __contains__.

This allows "rev in repo.changelog" in the next patch.
Pierre-Yves David - Feb. 4, 2015, 4:21 p.m.
On 02/04/2015 01:27 PM, Yuya Nishihara wrote:
> # HG changeset patch
> # User Yuya Nishihara <yuya@tcha.org>
> # Date 1423052757 -32400
> #      Wed Feb 04 21:25:57 2015 +0900
> # Node ID f65a44d4905a5b4767e4b901de6ed1efe703611f
> # Parent  e1dbe0b215ae137eec53ceb12440536d570a83d2
> revlog: add __contains__ for fast membership test
>
> Because revlog implements __iter__, "rev in revlog" works but does silly O(n)
> lookup unexpectedly. So it seems good to add fast version of __contains__.
>
> This allows "rev in repo.changelog" in the next patch.

Pushed to the clowncopter with a couple of follow up idea.

>
> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
> --- a/mercurial/changelog.py
> +++ b/mercurial/changelog.py
> @@ -143,6 +143,11 @@ class changelog(revlog.revlog):
>               if i not in self.filteredrevs:
>                   return self.node(i)
>
> +    def __contains__(self, rev):
> +        """filtered version of revlog.__contains__"""
> +        return (revlog.revlog.__contains__(self, rev)
> +                and rev not in self.filteredrevs)

I think there is some magical optimisation to do here, like hard coding 
the __contains__ content and maybe test for the existence of any 
filtered revision before checking for members
Gregory Szorc - Feb. 4, 2015, 4:29 p.m.
diff --git a/mercurial/changelog.py b/mercurial/changelog.py

> --- a/mercurial/changelog.py
> +++ b/mercurial/changelog.py
> @@ -143,6 +143,11 @@ class changelog(revlog.revlog):
>              if i not in self.filteredrevs:
>                  return self.node(i)
>
> +    def __contains__(self, rev):
> +        """filtered version of revlog.__contains__"""
> +        return (revlog.revlog.__contains__(self, rev)
> +                and rev not in self.filteredrevs)
> +
>

What you've implemented here is changelog.hasunfilteredrev(). I'm not
convinced __contains__ should be implemented:

a) to only accept numeric revisions (what about nodes)
b) to only operate on unfiltered revs

"a" applies to revlog.__contains__ as well.
Yuya Nishihara - Feb. 5, 2015, 12:03 p.m.
On Wed, 04 Feb 2015 16:21:05 +0000, Pierre-Yves David wrote:
> On 02/04/2015 01:27 PM, Yuya Nishihara wrote:
> > +    def __contains__(self, rev):
> > +        """filtered version of revlog.__contains__"""
> > +        return (revlog.revlog.__contains__(self, rev)
> > +                and rev not in self.filteredrevs)
> 
> I think there is some magical optimisation to do here, like hard coding 
> the __contains__ content and maybe test for the existence of any 
> filtered revision before checking for members

I thought about it, but I didn't.  There's no hot loop calling __contains__,
so I can't say it has benefit now.

Regards,
Yuya Nishihara - Feb. 5, 2015, 12:18 p.m.
On Wed, 4 Feb 2015 08:29:08 -0800, Gregory Szorc wrote:
> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
> > --- a/mercurial/changelog.py
> > +++ b/mercurial/changelog.py
> > @@ -143,6 +143,11 @@ class changelog(revlog.revlog):
> >              if i not in self.filteredrevs:
> >                  return self.node(i)
> >
> > +    def __contains__(self, rev):
> > +        """filtered version of revlog.__contains__"""
> > +        return (revlog.revlog.__contains__(self, rev)
> > +                and rev not in self.filteredrevs)
> > +
> >
> 
> What you've implemented here is changelog.hasunfilteredrev(). I'm not
> convinced __contains__ should be implemented:
> 
> a) to only accept numeric revisions (what about nodes)
> b) to only operate on unfiltered revs
> 
> "a" applies to revlog.__contains__ as well.

What about hasrev(rev) because there's hasnode(node) ?
Anyway, I think __contains__ should be implemented to ban use of slow
"x in repo.changelog".

Regards,
Pierre-Yves David - Feb. 5, 2015, 12:53 p.m.
On 02/04/2015 04:29 PM, Gregory Szorc wrote:
> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
>
>     --- a/mercurial/changelog.py
>     +++ b/mercurial/changelog.py
>     @@ -143,6 +143,11 @@ class changelog(revlog.revlog):
>                   if i not in self.filteredrevs:
>                       return self.node(i)
>
>     +    def __contains__(self, rev):
>     +        """filtered version of revlog.__contains__"""
>     +        return (revlog.revlog.__contains__(self, rev)
>     +                and rev not in self.filteredrevs)
>     +
>
>
> What you've implemented here is changelog.hasunfilteredrev(). I'm not
> convinced __contains__ should be implemented:

__contains__ should be implemented because otherwise, "rev in changelog" 
is still valid but ultra slow (uses __iter__).

> a) to only accept numeric revisions (what about nodes)

  The current semantic accepts revs only and I'm fine with that.

> b) to only operate on unfiltered revs

No it does not? It looks at `self.filteredrevs`.
Pierre-Yves David - Feb. 5, 2015, 12:55 p.m.
On 02/05/2015 12:03 PM, Yuya Nishihara wrote:
> On Wed, 04 Feb 2015 16:21:05 +0000, Pierre-Yves David wrote:
>> On 02/04/2015 01:27 PM, Yuya Nishihara wrote:
>>> +    def __contains__(self, rev):
>>> +        """filtered version of revlog.__contains__"""
>>> +        return (revlog.revlog.__contains__(self, rev)
>>> +                and rev not in self.filteredrevs)
>>
>> I think there is some magical optimisation to do here, like hard coding
>> the __contains__ content and maybe test for the existence of any
>> filtered revision before checking for members
>
> I thought about it, but I didn't.  There's no hot loop calling __contains__,
> so I can't say it has benefit now.

I think that hardlinking the revlog.__contains__ is simple and a pure 
win. The rest can probably wait for real usecase.

Patch

diff --git a/mercurial/changelog.py b/mercurial/changelog.py
--- a/mercurial/changelog.py
+++ b/mercurial/changelog.py
@@ -143,6 +143,11 @@  class changelog(revlog.revlog):
             if i not in self.filteredrevs:
                 return self.node(i)
 
+    def __contains__(self, rev):
+        """filtered version of revlog.__contains__"""
+        return (revlog.revlog.__contains__(self, rev)
+                and rev not in self.filteredrevs)
+
     def __iter__(self):
         """filtered version of revlog.__iter__"""
         if len(self.filteredrevs) == 0:
diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -277,6 +277,8 @@  class revlog(object):
 
     def tip(self):
         return self.node(len(self.index) - 2)
+    def __contains__(self, rev):
+        return 0 <= rev < len(self)
     def __len__(self):
         return len(self.index) - 1
     def __iter__(self):