Submitter | Gregory Szorc |
---|---|
Date | March 25, 2014, 3:01 a.m. |
Message ID | <6f64e244c57706cfb123.1395716469@77.1.168.192.in-addr.arpa> |
Download | mbox | patch |
Permalink | /patch/4053/ |
State | Accepted |
Commit | 3210b79308994a464f8d0876e0c0da66cec79677 |
Headers | show |
Comments
On 3/24/14 8:01 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: ># HG changeset patch ># User Gregory Szorc <gregory.szorc@gmail.com> ># Date 1395716418 25200 ># Mon Mar 24 20:00:18 2014 -0700 ># Node ID 6f64e244c57706cfb123c32c4fadef63690eed40 ># Parent 3879ac3858ffd9bb46e19fcc3a2b31d7bb2b54c5 >revset: improve performance of _generatorset.__contains__ (issue 4201) Looks good to me. I tried it in our repo using this revset: (children(ancestor(X, Y)) and ::(X)):: , which is what rebase does A recent version of @ takes 40 seconds With this patch it takes 20 seconds With 2.9 it takes 10 seconds So there's still room for improvement. I'm wondering why this didn't show up more brightly on our internal charts :/
On Tue, Mar 25, 2014 at 05:42:43PM +0000, Durham Goode wrote: > On 3/24/14 8:01 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: > > ># HG changeset patch > ># User Gregory Szorc <gregory.szorc@gmail.com> > ># Date 1395716418 25200 > ># Mon Mar 24 20:00:18 2014 -0700 > ># Node ID 6f64e244c57706cfb123c32c4fadef63690eed40 > ># Parent 3879ac3858ffd9bb46e19fcc3a2b31d7bb2b54c5 > >revset: improve performance of _generatorset.__contains__ (issue 4201) > > > Looks good to me. I tried it in our repo using this revset: Agreed. Queued. > (children(ancestor(X, Y)) and ::(X)):: , which is what rebase does > > A recent version of @ takes 40 seconds > With this patch it takes 20 seconds > With 2.9 it takes 10 seconds > > So there's still room for improvement. > > I'm wondering why this didn't show up more brightly on our internal charts > :/ > > _______________________________________________ > Mercurial-devel mailing list > Mercurial-devel@selenic.com > http://selenic.com/mailman/listinfo/mercurial-devel
On 3/25/14, 12:31 PM, Augie Fackler wrote: > On Tue, Mar 25, 2014 at 05:42:43PM +0000, Durham Goode wrote: >> On 3/24/14 8:01 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: >> >>> # HG changeset patch >>> # User Gregory Szorc <gregory.szorc@gmail.com> >>> # Date 1395716418 25200 >>> # Mon Mar 24 20:00:18 2014 -0700 >>> # Node ID 6f64e244c57706cfb123c32c4fadef63690eed40 >>> # Parent 3879ac3858ffd9bb46e19fcc3a2b31d7bb2b54c5 >>> revset: improve performance of _generatorset.__contains__ (issue 4201) >> >> >> Looks good to me. I tried it in our repo using this revset: > > Agreed. Queued. FWIW, I've been running with this patch all day today and I've noticed performance improvements all across the board. Rebases are obviously faster. Evolve is much faster as well. Mercurial no longer feels sluggish when operating on my 200,000+ changeset repo.
On 03/25/2014 02:26 PM, Gregory Szorc wrote: > On 3/25/14, 12:31 PM, Augie Fackler wrote: >> On Tue, Mar 25, 2014 at 05:42:43PM +0000, Durham Goode wrote: >>> On 3/24/14 8:01 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: >>> >>>> # HG changeset patch >>>> # User Gregory Szorc <gregory.szorc@gmail.com> >>>> # Date 1395716418 25200 >>>> # Mon Mar 24 20:00:18 2014 -0700 >>>> # Node ID 6f64e244c57706cfb123c32c4fadef63690eed40 >>>> # Parent 3879ac3858ffd9bb46e19fcc3a2b31d7bb2b54c5 >>>> revset: improve performance of _generatorset.__contains__ (issue 4201) >>> >>> >>> Looks good to me. I tried it in our repo using this revset: >> >> Agreed. Queued. > > FWIW, I've been running with this patch all day today and I've noticed > performance improvements all across the board. Rebases are obviously > faster. Evolve is much faster as well. Mercurial no longer feels > sluggish when operating on my 200,000+ changeset repo. Can you add one (or few) revset showing the issue to contrib/revsetbenchmarks.txt So we can eventually catch regression?
On 3/25/14 2:26 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: >On 3/25/14, 12:31 PM, Augie Fackler wrote: >> On Tue, Mar 25, 2014 at 05:42:43PM +0000, Durham Goode wrote: >>> On 3/24/14 8:01 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: >>> >>>> # HG changeset patch >>>> # User Gregory Szorc <gregory.szorc@gmail.com> >>>> # Date 1395716418 25200 >>>> # Mon Mar 24 20:00:18 2014 -0700 >>>> # Node ID 6f64e244c57706cfb123c32c4fadef63690eed40 >>>> # Parent 3879ac3858ffd9bb46e19fcc3a2b31d7bb2b54c5 >>>> revset: improve performance of _generatorset.__contains__ (issue 4201) >>> >>> >>> Looks good to me. I tried it in our repo using this revset: >> >> Agreed. Queued. > >FWIW, I've been running with this patch all day today and I've noticed >performance improvements all across the board. Rebases are obviously >faster. Evolve is much faster as well. Mercurial no longer feels >sluggish when operating on my 200,000+ changeset repo. I just found a bug in this patch. In: def __iter__(self): if self._iterated: # At least a part of the list should be cached if iteration has # started over the generatorset. for l in self._genlist: yield l for item in self._consumegen(): yield item def _consumegen(self): self._iterated = True for item in self._gen: self._cache[item] = True self._genlist.append(item) yield item If two revsets are running _consumegen on the same revset at the same time, they could both be stepping through self._gen inside of their respective _consumegen calls. Causing them to see different, mutually exclusive outputs of _consumegen. I noticed it due to changes I'm making to improve root() and _descendants(), so I don't have a repro for @ at the moment. I'll try to think up a fix.
On 3/25/14, 3:38 PM, Durham Goode wrote: > On 3/25/14 2:26 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: > >> On 3/25/14, 12:31 PM, Augie Fackler wrote: >>> On Tue, Mar 25, 2014 at 05:42:43PM +0000, Durham Goode wrote: >>>> On 3/24/14 8:01 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: >>>> >>>>> # HG changeset patch >>>>> # User Gregory Szorc <gregory.szorc@gmail.com> >>>>> # Date 1395716418 25200 >>>>> # Mon Mar 24 20:00:18 2014 -0700 >>>>> # Node ID 6f64e244c57706cfb123c32c4fadef63690eed40 >>>>> # Parent 3879ac3858ffd9bb46e19fcc3a2b31d7bb2b54c5 >>>>> revset: improve performance of _generatorset.__contains__ (issue 4201) >>>> >>>> >>>> Looks good to me. I tried it in our repo using this revset: >>> >>> Agreed. Queued. >> >> FWIW, I've been running with this patch all day today and I've noticed >> performance improvements all across the board. Rebases are obviously >> faster. Evolve is much faster as well. Mercurial no longer feels >> sluggish when operating on my 200,000+ changeset repo. > > I just found a bug in this patch. In: > > def __iter__(self): > if self._iterated: > # At least a part of the list should be cached if iteration has > # started over the generatorset. > for l in self._genlist: > yield l > > for item in self._consumegen(): > yield item > > def _consumegen(self): > self._iterated = True > > for item in self._gen: > self._cache[item] = True > self._genlist.append(item) > yield item > > > If two revsets are running _consumegen on the same revset at the same > time, they could both be stepping through self._gen inside of their > respective _consumegen calls. Causing them to see different, mutually > exclusive outputs of _consumegen. > > I noticed it due to changes I'm making to improve root() and > _descendants(), so I don't have a repro for @ at the moment. I'll try to > think up a fix. Oh, hmmm. I think you meant to say "respective __iter__ calls" right? I don't see the problem of having multiple consumers of _consumegen, as each _consumegen frame will be consuming from the same underlying generator and the results of consuming from that generator should be idempotent, regardless of which consumer was doing it. I see a few solutions: 1) Check the expected length of self._genlist after each item from _consumegen is obtained in __iter__. If we didn't resume where we left off (someone else consumed via _consumegen), yield the missing elements first. 2) Implement the iterator protocol manually. See http://docs.python.org/2/library/stdtypes.html#generator-types. Essentially, "yield" does magical things. We could implement our own iterator type and/or perform manual traversal of the underlying generator. I think #2 - while more complicated - might be more readable. Although it involves light Python magic.
On 3/25/14 4:12 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: >On 3/25/14, 3:38 PM, Durham Goode wrote: >> On 3/25/14 2:26 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: >> >>> On 3/25/14, 12:31 PM, Augie Fackler wrote: >>>> On Tue, Mar 25, 2014 at 05:42:43PM +0000, Durham Goode wrote: >>>>> On 3/24/14 8:01 PM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote: >>>>> >>>>>> # HG changeset patch >>>>>> # User Gregory Szorc <gregory.szorc@gmail.com> >>>>>> # Date 1395716418 25200 >>>>>> # Mon Mar 24 20:00:18 2014 -0700 >>>>>> # Node ID 6f64e244c57706cfb123c32c4fadef63690eed40 >>>>>> # Parent 3879ac3858ffd9bb46e19fcc3a2b31d7bb2b54c5 >>>>>> revset: improve performance of _generatorset.__contains__ (issue >>>>>>4201) >>>>> >>>>> >>>>> Looks good to me. I tried it in our repo using this revset: >>>> >>>> Agreed. Queued. >>> >>> FWIW, I've been running with this patch all day today and I've noticed >>> performance improvements all across the board. Rebases are obviously >>> faster. Evolve is much faster as well. Mercurial no longer feels >>> sluggish when operating on my 200,000+ changeset repo. >> >> I just found a bug in this patch. In: >> >> def __iter__(self): >> if self._iterated: >> # At least a part of the list should be cached if >>iteration has >> # started over the generatorset. >> for l in self._genlist: >> yield l >> >> for item in self._consumegen(): >> yield item >> >> def _consumegen(self): >> self._iterated = True >> >> for item in self._gen: >> self._cache[item] = True >> self._genlist.append(item) >> yield item >> >> >> If two revsets are running _consumegen on the same revset at the same >> time, they could both be stepping through self._gen inside of their >> respective _consumegen calls. Causing them to see different, mutually >> exclusive outputs of _consumegen. >> >> I noticed it due to changes I'm making to improve root() and >> _descendants(), so I don't have a repro for @ at the moment. I'll try >>to >> think up a fix. > >Oh, hmmm. > >I think you meant to say "respective __iter__ calls" right? I don't see >the problem of having multiple consumers of _consumegen, as each >_consumegen frame will be consuming from the same underlying generator >and the results of consuming from that generator should be idempotent, >regardless of which consumer was doing it. > >I see a few solutions: > >1) Check the expected length of self._genlist after each item from >_consumegen is obtained in __iter__. If we didn't resume where we left >off (someone else consumed via _consumegen), yield the missing elements >first. > >2) Implement the iterator protocol manually. See >https://urldefense.proofpoint.com/v1/url?u=http://docs.python.org/2/librar >y/stdtypes.html%23generator-types&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=pHOG >6Hz51SkYmYr%2FxoTFzw%3D%3D%0A&m=vZ%2FOy4UTZKbqEw13a2p0h7jGROCW40ZPTQADNyt1 >0ok%3D%0A&s=171f576db00836698a9903475016ff0a991c99b2db8604948fde0b6f12e27c >c6. >Essentially, "yield" does magical things. We could implement our own >iterator type and/or perform manual traversal of the underlying generator. > >I think #2 - while more complicated - might be more readable. Although >it involves light Python magic. > I've already created a patch that does #1. Running tests on it now. I'll either send it out by itself, or with my other perf patches.
Patch
diff --git a/mercurial/revset.py b/mercurial/revset.py --- a/mercurial/revset.py +++ b/mercurial/revset.py @@ -2625,42 +2625,43 @@ class _generatorset(object): self._genlist = baseset([]) self._iterated = False self._finished = False def __contains__(self, x): if x in self._cache: return self._cache[x] - # Use __iter__ which caches values and stores them into self._genlist - for l in self: + # Use new values only, as existing values would be cached. + for l in self._consumegen(): if l == x: return True self._finished = True self._cache[x] = False return False def __iter__(self): if self._iterated: # At least a part of the list should be cached if iteration has # started over the generatorset. for l in self._genlist: yield l - else: - # Starting iteration over the generatorset. - self._iterated = True + + for item in self._consumegen(): + yield item + + def _consumegen(self): + self._iterated = True for item in self._gen: self._cache[item] = True self._genlist.append(item) yield item - # Iteration over the generator has finished. Whole value list should be - # cached in self._genlist self._finished = True def set(self): return self def sort(self, reverse=False): if not self._finished: for i in self: @@ -2675,17 +2676,18 @@ class _ascgeneratorset(_generatorset): This class does not duck-type baseset and it's only supposed to be used internally """ def __contains__(self, x): if x in self._cache: return self._cache[x] - for l in self: + # Use new values only, as existing values would be cached. + for l in self._consumegen(): if l == x: return True if l > x: break self._cache[x] = False return False @@ -2697,17 +2699,18 @@ class _descgeneratorset(_generatorset): This class does not duck-type baseset and it's only supposed to be used internally """ def __contains__(self, x): if x in self._cache: return self._cache[x] - for l in self: + # Use new values only, as existing values would be cached. + for l in self._consumegen(): if l == x: return True if l < x: break self._cache[x] = False return False