Patchwork [STABLE?] tags: create new sortdict for performance reasons

login
register
mail settings
Submitter Gregory Szorc
Date Nov. 12, 2015, 9:17 p.m.
Message ID <c541bc4fc379f76d0cf3.1447363070@gps-mbp.local>
Download mbox | patch
Permalink /patch/11383/
State Accepted
Headers show

Comments

Gregory Szorc - Nov. 12, 2015, 9:17 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1447362964 28800
#      Thu Nov 12 13:16:04 2015 -0800
# Node ID c541bc4fc379f76d0cf30354942f78aaddd26f04
# Parent  150fb782ef80919e1c8cc7d4b047d78f889c7846
tags: create new sortdict for performance reasons

sortdict internally maintains a list of keys in insertion order. When a
key is replaced via __setitem__, we .remove() from this list. This
involves a linear scan and array adjustment. This is an expensive
operation.

The tags reading code was calling into sortdict.__setitem__ for each tag
in a read .hgtags revision. For repositories with thousands of tags or
thousands of .hgtags revisions, the overhead from list.remove()
noticeable.

This patch creates a new sortdict() so __setitem__ calls don't incur a
list.remove.

This doesn't appear to have any performance impact on my Firefox
repository. But that's only because tags reading doesn't show up in
profiles to begin with. I'm still waiting to hear from a user with over
10,000 tags and hundreds of heads on the impact of this patch.
Matt Mackall - Nov. 12, 2015, 11:11 p.m.
On Thu, 2015-11-12 at 13:17 -0800, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1447362964 28800
> #      Thu Nov 12 13:16:04 2015 -0800
> # Node ID c541bc4fc379f76d0cf30354942f78aaddd26f04
> # Parent  150fb782ef80919e1c8cc7d4b047d78f889c7846
> tags: create new sortdict for performance reasons

Queued for stable, thanks.

-- 
Mathematics is the supreme nostalgia of our time.
Augie Fackler - Nov. 21, 2015, 10:06 p.m.
On Thu, Nov 12, 2015 at 01:17:50PM -0800, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1447362964 28800
> #      Thu Nov 12 13:16:04 2015 -0800
> # Node ID c541bc4fc379f76d0cf30354942f78aaddd26f04
> # Parent  150fb782ef80919e1c8cc7d4b047d78f889c7846
> tags: create new sortdict for performance reasons

squinting at this lightly, I suspect sortdict could go away in favor
of collections.ordereddict now...

>
> sortdict internally maintains a list of keys in insertion order. When a
> key is replaced via __setitem__, we .remove() from this list. This
> involves a linear scan and array adjustment. This is an expensive
> operation.
>
> The tags reading code was calling into sortdict.__setitem__ for each tag
> in a read .hgtags revision. For repositories with thousands of tags or
> thousands of .hgtags revisions, the overhead from list.remove()
> noticeable.
>
> This patch creates a new sortdict() so __setitem__ calls don't incur a
> list.remove.
>
> This doesn't appear to have any performance impact on my Firefox
> repository. But that's only because tags reading doesn't show up in
> profiles to begin with. I'm still waiting to hear from a user with over
> 10,000 tags and hundreds of heads on the impact of this patch.
>
> diff --git a/mercurial/tags.py b/mercurial/tags.py
> --- a/mercurial/tags.py
> +++ b/mercurial/tags.py
> @@ -220,11 +220,15 @@ def _readtags(ui, repo, lines, fn, recod
>      All node ids are binary, not hex.
>      '''
>      filetags, nodelines = _readtaghist(ui, repo, lines, fn, recode=recode,
>                                         calcnodelines=calcnodelines)
> +    # util.sortdict().__setitem__ is much slower at replacing then inserting
> +    # new entries. The difference can matter if there are thousands of tags.
> +    # Create a new sortdict to avoid the performance penalty.
> +    newtags = util.sortdict()
>      for tag, taghist in filetags.items():
> -        filetags[tag] = (taghist[-1], taghist[:-1])
> -    return filetags
> +        newtags[tag] = (taghist[-1], taghist[:-1])
> +    return newtags
>
>  def _updatetags(filetags, tagtype, alltags, tagtypes):
>      '''Incorporate the tag info read from one file into the two
>      dictionaries, alltags and tagtypes, that contain all tag
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> https://selenic.com/mailman/listinfo/mercurial-devel
Gregory Szorc - Nov. 22, 2015, 12:18 a.m.
On Sat, Nov 21, 2015 at 2:06 PM, Augie Fackler <raf@durin42.com> wrote:

> On Thu, Nov 12, 2015 at 01:17:50PM -0800, Gregory Szorc wrote:
> > # HG changeset patch
> > # User Gregory Szorc <gregory.szorc@gmail.com>
> > # Date 1447362964 28800
> > #      Thu Nov 12 13:16:04 2015 -0800
> > # Node ID c541bc4fc379f76d0cf30354942f78aaddd26f04
> > # Parent  150fb782ef80919e1c8cc7d4b047d78f889c7846
> > tags: create new sortdict for performance reasons
>
> squinting at this lightly, I suspect sortdict could go away in favor
> of collections.ordereddict now...
>

Requires 2.7 :/


>
> >
> > sortdict internally maintains a list of keys in insertion order. When a
> > key is replaced via __setitem__, we .remove() from this list. This
> > involves a linear scan and array adjustment. This is an expensive
> > operation.
> >
> > The tags reading code was calling into sortdict.__setitem__ for each tag
> > in a read .hgtags revision. For repositories with thousands of tags or
> > thousands of .hgtags revisions, the overhead from list.remove()
> > noticeable.
> >
> > This patch creates a new sortdict() so __setitem__ calls don't incur a
> > list.remove.
> >
> > This doesn't appear to have any performance impact on my Firefox
> > repository. But that's only because tags reading doesn't show up in
> > profiles to begin with. I'm still waiting to hear from a user with over
> > 10,000 tags and hundreds of heads on the impact of this patch.
> >
> > diff --git a/mercurial/tags.py b/mercurial/tags.py
> > --- a/mercurial/tags.py
> > +++ b/mercurial/tags.py
> > @@ -220,11 +220,15 @@ def _readtags(ui, repo, lines, fn, recod
> >      All node ids are binary, not hex.
> >      '''
> >      filetags, nodelines = _readtaghist(ui, repo, lines, fn,
> recode=recode,
> >                                         calcnodelines=calcnodelines)
> > +    # util.sortdict().__setitem__ is much slower at replacing then
> inserting
> > +    # new entries. The difference can matter if there are thousands of
> tags.
> > +    # Create a new sortdict to avoid the performance penalty.
> > +    newtags = util.sortdict()
> >      for tag, taghist in filetags.items():
> > -        filetags[tag] = (taghist[-1], taghist[:-1])
> > -    return filetags
> > +        newtags[tag] = (taghist[-1], taghist[:-1])
> > +    return newtags
> >
> >  def _updatetags(filetags, tagtype, alltags, tagtypes):
> >      '''Incorporate the tag info read from one file into the two
> >      dictionaries, alltags and tagtypes, that contain all tag
> > _______________________________________________
> > Mercurial-devel mailing list
> > Mercurial-devel@selenic.com
> > https://selenic.com/mailman/listinfo/mercurial-devel
>
Augie Fackler - Nov. 22, 2015, 12:19 a.m.
On Nov 21, 2015 19:18, "Gregory Szorc" <gregory.szorc@gmail.com> wrote:
>
> On Sat, Nov 21, 2015 at 2:06 PM, Augie Fackler <raf@durin42.com> wrote:
>>
>> On Thu, Nov 12, 2015 at 01:17:50PM -0800, Gregory Szorc wrote:
>> > # HG changeset patch
>> > # User Gregory Szorc <gregory.szorc@gmail.com>
>> > # Date 1447362964 28800
>> > #      Thu Nov 12 13:16:04 2015 -0800
>> > # Node ID c541bc4fc379f76d0cf30354942f78aaddd26f04
>> > # Parent  150fb782ef80919e1c8cc7d4b047d78f889c7846
>> > tags: create new sortdict for performance reasons
>>
>> squinting at this lightly, I suspect sortdict could go away in favor
>> of collections.ordereddict now...
>
>
> Requires 2.7 :/

Welp.

>
>>
>>
>> >
>> > sortdict internally maintains a list of keys in insertion order. When a
>> > key is replaced via __setitem__, we .remove() from this list. This
>> > involves a linear scan and array adjustment. This is an expensive
>> > operation.
>> >
>> > The tags reading code was calling into sortdict.__setitem__ for each
tag
>> > in a read .hgtags revision. For repositories with thousands of tags or
>> > thousands of .hgtags revisions, the overhead from list.remove()
>> > noticeable.
>> >
>> > This patch creates a new sortdict() so __setitem__ calls don't incur a
>> > list.remove.
>> >
>> > This doesn't appear to have any performance impact on my Firefox
>> > repository. But that's only because tags reading doesn't show up in
>> > profiles to begin with. I'm still waiting to hear from a user with over
>> > 10,000 tags and hundreds of heads on the impact of this patch.
>> >
>> > diff --git a/mercurial/tags.py b/mercurial/tags.py
>> > --- a/mercurial/tags.py
>> > +++ b/mercurial/tags.py
>> > @@ -220,11 +220,15 @@ def _readtags(ui, repo, lines, fn, recod
>> >      All node ids are binary, not hex.
>> >      '''
>> >      filetags, nodelines = _readtaghist(ui, repo, lines, fn,
recode=recode,
>> >                                         calcnodelines=calcnodelines)
>> > +    # util.sortdict().__setitem__ is much slower at replacing then
inserting
>> > +    # new entries. The difference can matter if there are thousands
of tags.
>> > +    # Create a new sortdict to avoid the performance penalty.
>> > +    newtags = util.sortdict()
>> >      for tag, taghist in filetags.items():
>> > -        filetags[tag] = (taghist[-1], taghist[:-1])
>> > -    return filetags
>> > +        newtags[tag] = (taghist[-1], taghist[:-1])
>> > +    return newtags
>> >
>> >  def _updatetags(filetags, tagtype, alltags, tagtypes):
>> >      '''Incorporate the tag info read from one file into the two
>> >      dictionaries, alltags and tagtypes, that contain all tag
>> > _______________________________________________
>> > Mercurial-devel mailing list
>> > Mercurial-devel@selenic.com
>> > https://selenic.com/mailman/listinfo/mercurial-devel
>
>
Yuya Nishihara - Nov. 22, 2015, 3:30 a.m.
On Sat, 21 Nov 2015 19:19:16 -0500, Augie Fackler wrote:
> On Nov 21, 2015 19:18, "Gregory Szorc" <gregory.szorc@gmail.com> wrote:
> >
> > On Sat, Nov 21, 2015 at 2:06 PM, Augie Fackler <raf@durin42.com> wrote:
> >>
> >> On Thu, Nov 12, 2015 at 01:17:50PM -0800, Gregory Szorc wrote:
> >> > # HG changeset patch
> >> > # User Gregory Szorc <gregory.szorc@gmail.com>
> >> > # Date 1447362964 28800
> >> > #      Thu Nov 12 13:16:04 2015 -0800
> >> > # Node ID c541bc4fc379f76d0cf30354942f78aaddd26f04
> >> > # Parent  150fb782ef80919e1c8cc7d4b047d78f889c7846
> >> > tags: create new sortdict for performance reasons
> >>
> >> squinting at this lightly, I suspect sortdict could go away in favor
> >> of collections.ordereddict now...
> >
> >
> > Requires 2.7 :/
> 
> Welp.

and the OrderedDict doesn't change the order of existing items.

>>> d = OrderedDict([('a', 0), ('b', 1)])
>>> d.items()
[('a', 0), ('b', 1)]
>>> d['a'] = 2
>>> d.items()
[('a', 2), ('b', 1)]

Patch

diff --git a/mercurial/tags.py b/mercurial/tags.py
--- a/mercurial/tags.py
+++ b/mercurial/tags.py
@@ -220,11 +220,15 @@  def _readtags(ui, repo, lines, fn, recod
     All node ids are binary, not hex.
     '''
     filetags, nodelines = _readtaghist(ui, repo, lines, fn, recode=recode,
                                        calcnodelines=calcnodelines)
+    # util.sortdict().__setitem__ is much slower at replacing then inserting
+    # new entries. The difference can matter if there are thousands of tags.
+    # Create a new sortdict to avoid the performance penalty.
+    newtags = util.sortdict()
     for tag, taghist in filetags.items():
-        filetags[tag] = (taghist[-1], taghist[:-1])
-    return filetags
+        newtags[tag] = (taghist[-1], taghist[:-1])
+    return newtags
 
 def _updatetags(filetags, tagtype, alltags, tagtypes):
     '''Incorporate the tag info read from one file into the two
     dictionaries, alltags and tagtypes, that contain all tag