Patchwork [3,of,6] revset: do not try to sort revisions unnecessarily by addset._iterator()

login
register
mail settings
Submitter Pierre-Yves David
Date May 14, 2015, 5:38 a.m.
Message ID <555434D8.2010402@ens-lyon.org>
Download mbox | patch
Permalink /patch/9067/
State Not Applicable
Headers show

Comments

Pierre-Yves David - May 14, 2015, 5:38 a.m.
On 05/13/2015 07:39 AM, Yuya Nishihara wrote:
> On Tue, 12 May 2015 17:52:30 -0700, Pierre-Yves David wrote:
>> On 05/12/2015 03:59 PM, Yuya Nishihara wrote:
>>> # HG changeset patch
>>> # User Yuya Nishihara <yuya@tcha.org>
>>> # Date 1427716479 -32400
>>> #      Mon Mar 30 20:54:39 2015 +0900
>>> # Node ID cc7253d43150ef937aca205f2cb66a569378aeea
>>> # Parent  e487cb80cc333a4d92376daa29c08fa703c5b249
>>> revset: do not try to sort revisions unnecessarily by addset._iterator()
>>>
>>> This fixes addset to remove duplicated revisions from right-hand-side set
>>> even if one of the subset does not support fastasc/fastdesc.
>>>
>>> addset._iterator() did not do the right thing if _ascending is set, because
>>> addset._iterordered() expects that the given sets are sorted. This patch
>>> removes the broken implementation. There's no need for addset._iterator()
>>> to sort revisions because the result will be sorted by _trysetasclist()
>>> as necessary.
>>
>> I agree the current behavior is buggy. But I'm dubious about your the
>> way to solve it.
>>
>> The special casing of 'self._ascending' is a good correctness and
>> optimisation purpose.
>>
>> I see two potential issues here:
>>
>> 1) we use iterator from self._r1 and self._r2 without ensuring they are
>> sorted in the order we need them to be sorted (or using appropriate
>> directed iterator).
>>
>> 2) _iterordered is apparently not doing deduplication.
>>
>> Fixing these two should solve the bug without performance and
>> correctness impact?
>>
>> As far as I understand your patch,  sorting the addset would result into
>> a sorted output most case.
>
> _iterordered() can remove duplicates if both r1 and r2 are sorted. So we have
> to fix (1).

This small patch seems to fix the duplication issue and fix the problem.


> But, when _iterator() is called with _ascending is not None (*1), we know at
> least one of r1 or r2 does not support fastasc/fastdesc. So we'll have to
> consume r1 or r2 anyway to get it sorted. I think the use of _iterordered() had
> little benefit in this case.
>
>   (*1) except when called from __len__(), which can be assumed to be slow

Is _ascending is not None, the iteration is requested to be done in a 
sorted order. I would assume than using _iterator here could give us 
data in the wrong order, but your test seems to disagree. I'll give a 
deeper look at why the code still behave right tomorrow.

In all cases, relying on sorting in the subset seems wise since each 
subset may have its own efficient way to get data in the right order.

>> I also would like you to check performance impact when touching revset
>> code. This means looking at the output from
>> 'contrib/revsetbenchmarks.py' with at least content of
>> 'contrib/revsetbenchmarks.t' (if the help of the script is too bad,
>> point it out and we'll improve that).
>
> Hmm, I got notable difference by "roots((0:tip)::)". This seems to be caused
> by the additional for-loop introduced by "revset: make addset class accepts
> more than two subsets". Perhaps, I'll have to optimize addset.__contains__ and
> _iterordered() for two subsets.

This is fairly hot code path we have to be careful.

Thanks for looking into this.
Yuya Nishihara - May 14, 2015, 2:27 p.m.
On Wed, 13 May 2015 22:38:32 -0700, Pierre-Yves David wrote:
> On 05/13/2015 07:39 AM, Yuya Nishihara wrote:
> > On Tue, 12 May 2015 17:52:30 -0700, Pierre-Yves David wrote:
> >> On 05/12/2015 03:59 PM, Yuya Nishihara wrote:
> >>> # HG changeset patch
> >>> # User Yuya Nishihara <yuya@tcha.org>
> >>> # Date 1427716479 -32400
> >>> #      Mon Mar 30 20:54:39 2015 +0900
> >>> # Node ID cc7253d43150ef937aca205f2cb66a569378aeea
> >>> # Parent  e487cb80cc333a4d92376daa29c08fa703c5b249
> >>> revset: do not try to sort revisions unnecessarily by addset._iterator()
> >>>
> >>> This fixes addset to remove duplicated revisions from right-hand-side set
> >>> even if one of the subset does not support fastasc/fastdesc.
> >>>
> >>> addset._iterator() did not do the right thing if _ascending is set, because
> >>> addset._iterordered() expects that the given sets are sorted. This patch
> >>> removes the broken implementation. There's no need for addset._iterator()
> >>> to sort revisions because the result will be sorted by _trysetasclist()
> >>> as necessary.
> >>
> >> I agree the current behavior is buggy. But I'm dubious about your the
> >> way to solve it.
> >>
> >> The special casing of 'self._ascending' is a good correctness and
> >> optimisation purpose.
> >>
> >> I see two potential issues here:
> >>
> >> 1) we use iterator from self._r1 and self._r2 without ensuring they are
> >> sorted in the order we need them to be sorted (or using appropriate
> >> directed iterator).
> >>
> >> 2) _iterordered is apparently not doing deduplication.
> >>
> >> Fixing these two should solve the bug without performance and
> >> correctness impact?
> >>
> >> As far as I understand your patch,  sorting the addset would result into
> >> a sorted output most case.
> >
> > _iterordered() can remove duplicates if both r1 and r2 are sorted. So we have
> > to fix (1).
> 
> This small patch seems to fix the duplication issue and fix the problem.
> 
> diff --git a/mercurial/revset.py b/mercurial/revset.py
> --- a/mercurial/revset.py
> +++ b/mercurial/revset.py
> @@ -3044,4 +3044,6 @@ class addset(abstractsmartset):
>               gen = gen()
>           else:
> +            self._r1.sort(not self._ascending)
> +            self._r2.sort(not self._ascending)
>               iter1 = iter(self._r1)
>               iter2 = iter(self._r2)

Yes. I was afraid of mutating r1 and r2 here because input sets might be
shared.

> > But, when _iterator() is called with _ascending is not None (*1), we know at
> > least one of r1 or r2 does not support fastasc/fastdesc. So we'll have to
> > consume r1 or r2 anyway to get it sorted. I think the use of _iterordered() had
> > little benefit in this case.
> >
> >   (*1) except when called from __len__(), which can be assumed to be slow
> 
> Is _ascending is not None, the iteration is requested to be done in a
> sorted order. I would assume than using _iterator here could give us
> data in the wrong order, but your test seems to disagree. I'll give a
> deeper look at why the code still behave right tomorrow.

That is because the result of _iterator() is sorted by _trysetasclist().

  __iter__()
    _list  # consume the gen and try again
      _genlist = _iterator()
    __iter__()  # reenter
      _trysetasclist()
        _asclist = sorted(_genlist)
      fastasc() or fastdesc()
        return _asclist.__iter__ or _asclist.__reversed__

Regards,
Pierre-Yves David - May 15, 2015, 8:32 a.m.
On 05/14/2015 07:27 AM, Yuya Nishihara wrote:
> On Wed, 13 May 2015 22:38:32 -0700, Pierre-Yves David wrote:
>> On 05/13/2015 07:39 AM, Yuya Nishihara wrote:
>>> On Tue, 12 May 2015 17:52:30 -0700, Pierre-Yves David wrote:
>>>> On 05/12/2015 03:59 PM, Yuya Nishihara wrote:
>>>>> # HG changeset patch
>>>>> # User Yuya Nishihara <yuya@tcha.org>
>>>>> # Date 1427716479 -32400
>>>>> #      Mon Mar 30 20:54:39 2015 +0900
>>>>> # Node ID cc7253d43150ef937aca205f2cb66a569378aeea
>>>>> # Parent  e487cb80cc333a4d92376daa29c08fa703c5b249
>>>>> revset: do not try to sort revisions unnecessarily by addset._iterator()
>>>>>
>>>>> This fixes addset to remove duplicated revisions from right-hand-side set
>>>>> even if one of the subset does not support fastasc/fastdesc.
>>>>>
>>>>> addset._iterator() did not do the right thing if _ascending is set, because
>>>>> addset._iterordered() expects that the given sets are sorted. This patch
>>>>> removes the broken implementation. There's no need for addset._iterator()
>>>>> to sort revisions because the result will be sorted by _trysetasclist()
>>>>> as necessary.
>>>>
>>>> I agree the current behavior is buggy. But I'm dubious about your the
>>>> way to solve it.
>>>>
>>>> The special casing of 'self._ascending' is a good correctness and
>>>> optimisation purpose.
>>>>
>>>> I see two potential issues here:
>>>>
>>>> 1) we use iterator from self._r1 and self._r2 without ensuring they are
>>>> sorted in the order we need them to be sorted (or using appropriate
>>>> directed iterator).
>>>>
>>>> 2) _iterordered is apparently not doing deduplication.
>>>>
>>>> Fixing these two should solve the bug without performance and
>>>> correctness impact?
>>>>
>>>> As far as I understand your patch,  sorting the addset would result into
>>>> a sorted output most case.
>>>
>>> _iterordered() can remove duplicates if both r1 and r2 are sorted. So we have
>>> to fix (1).
>>
>> This small patch seems to fix the duplication issue and fix the problem.
>>
>> diff --git a/mercurial/revset.py b/mercurial/revset.py
>> --- a/mercurial/revset.py
>> +++ b/mercurial/revset.py
>> @@ -3044,4 +3044,6 @@ class addset(abstractsmartset):
>>                gen = gen()
>>            else:
>> +            self._r1.sort(not self._ascending)
>> +            self._r2.sort(not self._ascending)
>>                iter1 = iter(self._r1)
>>                iter2 = iter(self._r2)
>
> Yes. I was afraid of mutating r1 and r2 here because input sets might be
> shared.

I cannot remember if mutating you sub-smartset is a sin or not. I've 
patchbombed a version that stay quite conservative on this approach

>>> But, when _iterator() is called with _ascending is not None (*1), we know at
>>> least one of r1 or r2 does not support fastasc/fastdesc. So we'll have to
>>> consume r1 or r2 anyway to get it sorted. I think the use of _iterordered() had
>>> little benefit in this case.
>>>
>>>    (*1) except when called from __len__(), which can be assumed to be slow
>>
>> Is _ascending is not None, the iteration is requested to be done in a
>> sorted order. I would assume than using _iterator here could give us
>> data in the wrong order, but your test seems to disagree. I'll give a
>> deeper look at why the code still behave right tomorrow.
>
> That is because the result of _iterator() is sorted by _trysetasclist().
>
>    __iter__()
>      _list  # consume the gen and try again
>        _genlist = _iterator()
>      __iter__()  # reenter
>        _trysetasclist()
>          _asclist = sorted(_genlist)
>        fastasc() or fastdesc()
>          return _asclist.__iter__ or _asclist.__reversed__

This _iterator() thing is actually a big silly. I think we could be 
lazyer (cf my pre-cited patch)

Patch

diff --git a/mercurial/revset.py b/mercurial/revset.py
--- a/mercurial/revset.py
+++ b/mercurial/revset.py
@@ -3044,4 +3044,6 @@  class addset(abstractsmartset):
              gen = gen()
          else:
+            self._r1.sort(not self._ascending)
+            self._r2.sort(not self._ascending)
              iter1 = iter(self._r1)
              iter2 = iter(self._r2)