Patchwork diff performance: re-establish linear runtime performance

login
register
mail settings
Submitter Elmar Bartel
Date April 30, 2020, 1:23 p.m.
Message ID <fd68848f8d5515f9a1af.1588252991@nuc15.leo.org>
Download mbox | patch
Permalink /patch/46252/
State Accepted
Headers show

Comments

Elmar Bartel - April 30, 2020, 1:23 p.m.
# HG changeset patch
# User Elmar Bartel <elb_hg@leo.org>
# Date 1588252205 -7200
#      Thu Apr 30 15:10:05 2020 +0200
# Node ID fd68848f8d5515f9a1afe288787f27fcdb130d42
# Parent  5c1f356b108e07782e07b824c206a87d3a9abcff
Pierre-Yves David - April 30, 2020, 2:25 p.m.
Interresting, did you run performance benchmark on some specific diff 
for comparison ? I would like to integrat this in our benchmark suite.

On  4/30/20 3:23 PM, Elmar Bartel wrote:
> # HG changeset patch
> # User Elmar Bartel <elb_hg@leo.org>
> # Date 1588252205 -7200
> #      Thu Apr 30 15:10:05 2020 +0200
> # Node ID fd68848f8d5515f9a1afe288787f27fcdb130d42
> # Parent  5c1f356b108e07782e07b824c206a87d3a9abcff
> diff performance: re-establish linear runtime performance
> 
> The previous method with sum() and list() creates a new list object
> for every hunk. Then sum() is used to flatten out this sequence of
> lists.  The sum() function is not "lazy", but creates a new list object
> for every "+" operation and so this code had quadratic runtime behaviour.
> 
> diff -r 5c1f356b108e -r fd68848f8d55 mercurial/patch.py
> --- a/mercurial/patch.py	Fri Apr 24 12:37:43 2020 -0700
> +++ b/mercurial/patch.py	Thu Apr 30 15:10:05 2020 +0200
> @@ -2558,7 +2558,7 @@ def diff(
>                   fctx2 is not None
>               ), b'fctx2 unexpectly None in diff hunks filtering'
>               hunks = hunksfilterfn(fctx2, hunks)
> -        text = b''.join(sum((list(hlines) for hrange, hlines in hunks), []))
> +        text = b''.join((b''.join(hlines) for hrange, hlines in hunks))
>           if hdr and (text or len(hdr) > 1):
>               yield b'\n'.join(hdr) + b'\n'
>           if text:
> 
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Yuya Nishihara - April 30, 2020, 2:41 p.m.
On Thu, 30 Apr 2020 16:25:09 +0200, Pierre-Yves David wrote:
> > The previous method with sum() and list() creates a new list object
> > for every hunk. Then sum() is used to flatten out this sequence of
> > lists.  The sum() function is not "lazy", but creates a new list object
> > for every "+" operation and so this code had quadratic runtime behaviour.
> > 
> > diff -r 5c1f356b108e -r fd68848f8d55 mercurial/patch.py
> > --- a/mercurial/patch.py	Fri Apr 24 12:37:43 2020 -0700
> > +++ b/mercurial/patch.py	Thu Apr 30 15:10:05 2020 +0200
> > @@ -2558,7 +2558,7 @@ def diff(
> >                   fctx2 is not None
> >               ), b'fctx2 unexpectly None in diff hunks filtering'
> >               hunks = hunksfilterfn(fctx2, hunks)
> > -        text = b''.join(sum((list(hlines) for hrange, hlines in hunks), []))
> > +        text = b''.join((b''.join(hlines) for hrange, hlines in hunks))

If the number of intermediate objects matters, creating a list of bytes might
be slightly faster.

  lines = []
  for hr, hl in hunks:
      lines.extend(hl)
  b''.join(lines)
Elmar Bartel - April 30, 2020, 6:55 p.m.
Hello Pierre,

> Interresting, did you run performance benchmark on some specific
> diff for comparison ?
Yes. ;-)

A little bit of history why this line got my attention:
I have to maintain a repo with a very large (800.000 lines) file where
a specific changeset touched one third of all lines. So this created a
long list if hunks and the diff on this specific changeset did not
return in a reasonable time (it took several minutes), so I
inverstigated what the problem was.
The funny thing is: this behaviour did not show up in a very old
version of hg which is used on the machine with the central repository. 
But it did show up on cloned repos where newer versions of hg where
used. This specific line was introduced with the change 92714858dd3e.

I've written a small shell/python script (attached) to reproduce this
kind of performance problem. 

> I would like to integrat this in our benchmark
> suite.
When the attached script helps to do so, I'd be glad.

Yours,
Elmar.
Elmar Bartel - April 30, 2020, 7:10 p.m.
Hello Yuya,

> If the number of intermediate objects matters, creating a list of
> bytes might be slightly faster.
> 
>   lines = []
>   for hr, hl in hunks:
>       lines.extend(hl)
>   b''.join(lines)
Might be!
I've not investigated this variant.
But I've taken a look at this variant:

  text = b''.join(line for hrange, hlines in hunks for line in hlines)

The runtime did not show any significant change.
And I'd assume the same for the variant with the explicit loop.
So its more a question of taste - and  then I'd prefer one of the
two one-liners.

Yours,
Elmar.
Yuya Nishihara - May 1, 2020, 11:32 a.m.
On Thu, 30 Apr 2020 21:10:44 +0200, Elmar Bartel wrote:
> > If the number of intermediate objects matters, creating a list of
> > bytes might be slightly faster.
> > 
> >   lines = []
> >   for hr, hl in hunks:
> >       lines.extend(hl)
> >   b''.join(lines)
> Might be!
> I've not investigated this variant.
> But I've taken a look at this variant:
> 
>   text = b''.join(line for hrange, hlines in hunks for line in hlines)
> 
> The runtime did not show any significant change.
> And I'd assume the same for the variant with the explicit loop.
> So its more a question of taste - and  then I'd prefer one of the
> two one-liners.

Okay. Since it's close to the release and this patch looks strictly better
than the original implementation, I've queued it for stable. Thanks!

Next time, please include some performance number in commit message.

Patch

diff performance: re-establish linear runtime performance

The previous method with sum() and list() creates a new list object
for every hunk. Then sum() is used to flatten out this sequence of
lists.  The sum() function is not "lazy", but creates a new list object
for every "+" operation and so this code had quadratic runtime behaviour.

diff -r 5c1f356b108e -r fd68848f8d55 mercurial/patch.py
--- a/mercurial/patch.py	Fri Apr 24 12:37:43 2020 -0700
+++ b/mercurial/patch.py	Thu Apr 30 15:10:05 2020 +0200
@@ -2558,7 +2558,7 @@  def diff(
                 fctx2 is not None
             ), b'fctx2 unexpectly None in diff hunks filtering'
             hunks = hunksfilterfn(fctx2, hunks)
-        text = b''.join(sum((list(hlines) for hrange, hlines in hunks), []))
+        text = b''.join((b''.join(hlines) for hrange, hlines in hunks))
         if hdr and (text or len(hdr) > 1):
             yield b'\n'.join(hdr) + b'\n'
         if text: