Patchwork [4,of,4] revset: do not nest addset by "addset + other" (issue4565)

login
register
mail settings
Submitter Yuya Nishihara
Date May 19, 2015, 12:18 p.m.
Message ID <72e8a0b846616a92781b.1432037901@mimosa>
Download mbox | patch
Permalink /patch/9169/
State Changes Requested
Delegated to: Pierre-Yves David
Headers show

Comments

Yuya Nishihara - May 19, 2015, 12:18 p.m.
# HG changeset patch
# User Yuya Nishihara <yuya@tcha.org>
# Date 1426409247 -32400
#      Sun Mar 15 17:47:27 2015 +0900
# Node ID 72e8a0b846616a92781beda44eab8d6c39c91933
# Parent  9996864f9e14a826ebe20df3a48905a662fe8eba
revset: do not nest addset by "addset + other" (issue4565)

This fixes the crash caused by repeated -rREV options. Still chained OR
operations in single revset expression will exceed the Python stack limit
at the parsing phase.
Pierre-Yves David - May 21, 2015, 4:26 a.m.
On 05/19/2015 07:18 AM, Yuya Nishihara wrote:
> # HG changeset patch
> # User Yuya Nishihara <yuya@tcha.org>
> # Date 1426409247 -32400
> #      Sun Mar 15 17:47:27 2015 +0900
> # Node ID 72e8a0b846616a92781beda44eab8d6c39c91933
> # Parent  9996864f9e14a826ebe20df3a48905a662fe8eba
> revset: do not nest addset by "addset + other" (issue4565)

There is good idea in this series and some interesting win.

However, this is adding significant more complexity and special casing 
to the code. I'm feeling very uncomfortable about that since the 
smartset class are already "not simple". I feel like we'll need to start 
converting the code object to C class soon and every special case here 
(eg: new class). There is also some more generic optimisation that could 
be worthwhile. We should meet on IRC/video call to discuss the situation 
in details

Here is the nice thing I see in your series:
- The logarithmic combination used for the sorting,
- The creation of a wider object when something is added to a addset 
(wider instead of deeper)

There is couple of generic stuff that are worth looking at too:
- The optimizer could directly build a balanced tree for addition 
instead of making it pure depth.
- We could directly performs add operation for smartset that are already 
all known (baseset + baseset, the case that actually interest use for 
both currently open issue)

Also
- We should also determine if it is safe to alter smartset in place. 
This could simplify multiple things.

When and how could we discuss this?

(This is not a series, rejection, some of it may be taken as is after 
such discussion)
Yuya Nishihara - May 21, 2015, 12:28 p.m.
On Wed, 20 May 2015 23:26:35 -0500, Pierre-Yves David wrote:
> On 05/19/2015 07:18 AM, Yuya Nishihara wrote:
> > # HG changeset patch
> > # User Yuya Nishihara <yuya@tcha.org>
> > # Date 1426409247 -32400
> > #      Sun Mar 15 17:47:27 2015 +0900
> > # Node ID 72e8a0b846616a92781beda44eab8d6c39c91933
> > # Parent  9996864f9e14a826ebe20df3a48905a662fe8eba
> > revset: do not nest addset by "addset + other" (issue4565)
> 
> There is good idea in this series and some interesting win.
> 
> However, this is adding significant more complexity and special casing 
> to the code. I'm feeling very uncomfortable about that since the 
> smartset class are already "not simple". I feel like we'll need to start
> converting the code object to C class soon and every special case here
> (eg: new class).

Perhaps we can merge unionset and addset in C implementation. It just exists
to avoid the use of "any()".

> There is also some more generic optimisation that could
> be worthwhile. We should meet on IRC/video call to discuss the situation
> in details
>
> Here is the nice thing I see in your series:
> - The logarithmic combination used for the sorting,
> - The creation of a wider object when something is added to a addset 
> (wider instead of deeper)
> 
> There is couple of generic stuff that are worth looking at too:
> - The optimizer could directly build a balanced tree for addition
> instead of making it pure depth.

Yes for "a + b + c", but probably no for "-r a -r b -r c".

> - We could directly performs add operation for smartset that are already
> all known (baseset + baseset, the case that actually interest use for
> both currently open issue)

Isn't it too optimistic?

FWIW, I have 10 more patches that will do

 - fix stack overflow on alias expansion and optimize() (issue4624)
 - build unionset([A, B, C]) directly from "a() + b() + c()"
   (perhaps, this can be balanced addsets)
 - build baseset([a, b, c]) directly from "a + b + c"

> Also
> - We should also determine if it is safe to alter smartset in place.
> This could simplify multiple things.
> 
> When and how could we discuss this?

My IRC time is around 12:00-14:00 UTC (21:00-23:00 JST) on weekdays,
or weekend?

Regards,
Yuya Nishihara - May 22, 2015, 12:29 p.m.
On Thu, 21 May 2015 23:22:03 -0500, Pierre-Yves David wrote:
> On 05/21/2015 07:28 AM, Yuya Nishihara wrote:
> > On Wed, 20 May 2015 23:26:35 -0500, Pierre-Yves David wrote:
> >> On 05/19/2015 07:18 AM, Yuya Nishihara wrote:
> >>> # HG changeset patch
> >>> # User Yuya Nishihara <yuya@tcha.org>
> >>> # Date 1426409247 -32400
> >>> #      Sun Mar 15 17:47:27 2015 +0900
> >>> # Node ID 72e8a0b846616a92781beda44eab8d6c39c91933
> >>> # Parent  9996864f9e14a826ebe20df3a48905a662fe8eba
> >>> revset: do not nest addset by "addset + other" (issue4565)
> >>
> >> There is good idea in this series and some interesting win.
> >>
> >> However, this is adding significant more complexity and special casing
> >> to the code. I'm feeling very uncomfortable about that since the
> >> smartset class are already "not simple". I feel like we'll need to start
> >> converting the code object to C class soon and every special case here
> >> (eg: new class).
> >
> > Perhaps we can merge unionset and addset in C implementation. It just exists
> > to avoid the use of "any()".
> 
> Since the only different is for the __contains__ and __nonzero__ case,
> we could keep a single class: 'addset(*subsets)' and monkey patch this
> them in the len(subsets) == 2 case. We are already using method monkey
> patching for optimisation purpose in other place of the code.
>
> (and yes, I kindof insist on the *subsets) syntax.

Okay, I prefer sub-classing than monkey-patching, and an explicit list than
*expansion, but they're just a matter of taste. And we won't need them if
we choose the balanced addsets plan.

> >> There is also some more generic optimisation that could
> >> be worthwhile. We should meet on IRC/video call to discuss the situation
> >> in details
> >>
> >> Here is the nice thing I see in your series:
> >> - The logarithmic combination used for the sorting,
> >> - The creation of a wider object when something is added to a addset
> >> (wider instead of deeper)
> >>
> >> There is couple of generic stuff that are worth looking at too:
> >> - The optimizer could directly build a balanced tree for addition
> >> instead of making it pure depth.
> >
> > Yes for "a + b + c", but probably no for "-r a -r b -r c".
> 
> the -r case seems even simpler as we control the creation of the things
> from start, the revrange logic should probably (1) first get a smartset
> from all elements in the list (2) then combine them with either a
> balanced tree addset or a addset(*results) if we get to that.

Good point. It seems I just didn't want to touch the big loop in revrange().
I'll take a look.

> Since your current union set implementation is already relying on
> "balanced tree" for its recursive execution of __iter__, we can probably
> rely on balanced tree first.
[...]
> 1) build baseset([a, b, c]) directly from "a + b + c"
>    (bonus point if it simple enough to go on stable)
> 
> 2) leave all of 'addset' implementation but for __init__,
>     We allow 'addset(*subsets)' and this automatically build a balanced 
> tree of addset with 2 children in each node.
> 
> 3) We use a single addset() call to unify all elements in a revrange() call
> 
> 4) The optimizer transforms nested addset into a single addset() call.
> 
> 5.1) We look at the benefit of replacing the tree build by 'addset' with 
> a single class.
> 
> 5.2) We play with the idea of building baseset from operation involving 
> baseset more directly.

(1) depends on (4).

My plan (that will use balanced addsets):

1) transform syntax tree: (+ (+ a b) c) -> (+ a b c)
2) build balanced addsets from it (general case)
3) optimize for trivial revisions: "a + b + c" -> baseset([a, b, c])
4) fix scmutil.revrange() not to chain addsets

> > My IRC time is around 12:00-14:00 UTC (21:00-23:00 JST) on weekdays,
> > or weekend?
> 
> Urg this is 03:00 05:00 in my timezone. I understand why we very rarely 
> cross each other.

I live in the other side of the earth. ;)

Patch

diff --git a/mercurial/revset.py b/mercurial/revset.py
--- a/mercurial/revset.py
+++ b/mercurial/revset.py
@@ -3074,6 +3074,15 @@  class unionset(abstractsmartset):
     >>> assert rs.fastdesc is None
     >>> [x for x in rs]
     [5, 4, 3, 2, 0, -1]
+
+    add set to existing union:
+    >>> rs = addset(xs, ys) + zs
+    >>> assert isinstance(rs, unionset)
+    >>> [x for x in rs]
+    [0, 3, 2, 5, 4, -1]
+    >>> rs = addset(xs, ys, ascending=True) + zs
+    >>> [x for x in rs]
+    [0, 2, 3, 4, 5, -1]
     """
     def __init__(self, subsets, ascending=None):
         self._subsets = subsets
@@ -3199,6 +3208,11 @@  class unionset(abstractsmartset):
         self.reverse()
         return val
 
+    def __add__(self, other):
+        if self._ascending is not None:
+            return super(unionset, self).__add__(other)
+        return unionset(self._subsets + [other])
+
     def __repr__(self):
         d = {None: '', False: '-', True: '+'}[self._ascending]
         return '<%s%s %s>' % (type(self).__name__, d,
diff --git a/tests/test-revset.t b/tests/test-revset.t
--- a/tests/test-revset.t
+++ b/tests/test-revset.t
@@ -142,10 +142,9 @@  trivial
       ('symbol', '1'))
     ('symbol', '2'))
   * set:
-  <addset
-    <addset
-      <baseset [0]>,
-      <baseset [1]>>,
+  <unionset
+    <baseset [0]>,
+    <baseset [1]>,
     <baseset [2]>>
   0
   1
@@ -919,6 +918,12 @@  test that `or` operation skips duplicate
   4
   5
 
+test that repeated `-r` options never eat up stack (issue4565)
+(uses `-r (0)` instead of `-r 0` to bypass old-style parser)
+
+  $ hg log -T '{rev}\n' `python -c "for i in xrange(200): print '-r (0) ',"`
+  0
+
 check that conversion to only works
   $ try --optimize '::3 - ::1'
   (minus