Patchwork [7,of,7,v2] bdiff: give slight preference to removing trailing lines

login
register
mail settings
Submitter Mads Kiilerich
Date Nov. 15, 2016, 8:57 p.m.
Message ID <efa30442953a6e1f096f.1479243428@madski>
Download mbox | patch
Permalink /patch/17594/
State Accepted
Headers show

Comments

Mads Kiilerich - Nov. 15, 2016, 8:57 p.m.
# HG changeset patch
# User Mads Kiilerich <madski@unity3d.com>
# Date 1479243409 -3600
#      Tue Nov 15 21:56:49 2016 +0100
# Node ID efa30442953a6e1f096f8833c4ee8047375600d6
# Parent  e022e995415bd48ffe6d4d03dc97b99dd547cde1
bdiff: give slight preference to removing trailing lines

[This change could be folded into the previous changeset to minimize the repo
churn ...]

Similar to the previous change, introduce an exception to the general
preference for matches in the middle of bdiff ranges: If the best match on the
B side starts at the beginning of the bdiff range, don't aim for the
middle-most A side match but for the earliest.

New (later) matches on the A side will only be considered better if the
corresponding match on the B side *not* is at the beginning of the range.
Thus, if the best (middle-most) match on the B side turns out to be at the
beginning of the range, the earliest match on the A side will be used.

The bundle size for 4.0 (hg bundle --base null -r 4.0 x.hg) happens to go from
22807275 to 22808120 bytes - a 0.004% increase.
Augie Fackler - Nov. 17, 2016, 5:42 p.m.
On Tue, Nov 15, 2016 at 09:57:08PM +0100, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1479243409 -3600
> #      Tue Nov 15 21:56:49 2016 +0100
> # Node ID efa30442953a6e1f096f8833c4ee8047375600d6
> # Parent  e022e995415bd48ffe6d4d03dc97b99dd547cde1
> bdiff: give slight preference to removing trailing lines

My own cursory perfbdiff runs suggest this is a perf wash (using
`perfbdiff -m 3041e4d59df2` in the mozilla repo). Queued. Thanks!

>
> [This change could be folded into the previous changeset to minimize the repo
> churn ...]
>
> Similar to the previous change, introduce an exception to the general
> preference for matches in the middle of bdiff ranges: If the best match on the
> B side starts at the beginning of the bdiff range, don't aim for the
> middle-most A side match but for the earliest.
>
> New (later) matches on the A side will only be considered better if the
> corresponding match on the B side *not* is at the beginning of the range.
> Thus, if the best (middle-most) match on the B side turns out to be at the
> beginning of the range, the earliest match on the A side will be used.
>
> The bundle size for 4.0 (hg bundle --base null -r 4.0 x.hg) happens to go from
> 22807275 to 22808120 bytes - a 0.004% increase.
>
> diff --git a/mercurial/bdiff.c b/mercurial/bdiff.c
> --- a/mercurial/bdiff.c
> +++ b/mercurial/bdiff.c
> @@ -184,7 +184,7 @@ static int longest_match(struct bdiff_li
>                               mj = j;
>                               mk = k;
>                       } else if (k == mk) {
> -				if (i > mi && i <= half) {
> +				if (i > mi && i <= half && j > b1) {
>                                       /* same match but closer to half */
>                                       mi = i;
>                                       mj = j;
> diff --git a/tests/test-bdiff.py b/tests/test-bdiff.py
> --- a/tests/test-bdiff.py
> +++ b/tests/test-bdiff.py
> @@ -88,7 +88,7 @@ print("Diff 1 to 3 lines - preference fo
>  showdiff('a\n', 'a\n' * 3)
>  print("Diff 1 to 5 lines - preference for appending:")
>  showdiff('a\n', 'a\n' * 5)
> -print("Diff 3 to 1 lines - preference for balanced recursion:")
> +print("Diff 3 to 1 lines - preference for removing trailing lines:")
>  showdiff('a\n' * 3, 'a\n')
> -print("Diff 5 to 1 lines - preference for balanced recursion:")
> +print("Diff 5 to 1 lines - preference for removing trailing lines:")
>  showdiff('a\n' * 5, 'a\n')
> diff --git a/tests/test-bdiff.py.out b/tests/test-bdiff.py.out
> --- a/tests/test-bdiff.py.out
> +++ b/tests/test-bdiff.py.out
> @@ -68,17 +68,15 @@ showdiff(
>    'a\na\na\na\na\n'):
>   'a\n'
>   2 2 '' -> 'a\na\na\na\n'
> -Diff 3 to 1 lines - preference for balanced recursion:
> +Diff 3 to 1 lines - preference for removing trailing lines:
>  showdiff(
>    'a\na\na\n',
>    'a\n'):
> - 0 2 'a\n' -> ''
>   'a\n'
> - 4 6 'a\n' -> ''
> -Diff 5 to 1 lines - preference for balanced recursion:
> + 2 6 'a\na\n' -> ''
> +Diff 5 to 1 lines - preference for removing trailing lines:
>  showdiff(
>    'a\na\na\na\na\n',
>    'a\n'):
> - 0 4 'a\na\n' -> ''
>   'a\n'
> - 6 10 'a\na\n' -> ''
> + 2 10 'a\na\na\na\n' -> ''
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Jun Wu - Nov. 24, 2016, 5:52 p.m.
Excerpts from Augie Fackler's message of 2016-11-17 12:42:26 -0500:
> My own cursory perfbdiff runs suggest this is a perf wash (using
> `perfbdiff -m 3041e4d59df2` in the mozilla repo). Queued. Thanks!

I'd mention this series changes the behavior of the diff output. The
difference was caught by fastannotate test.

See the below table (old: e1d6aa0e4c3a, new: 8836f13e3c5b):

   a | b | old | new
  --------------------
   a | a |  a  | -a
   a | z | +z  |  a
   a | a |  a  | +z
     |   | -a  |  a
  --------------------
   a | a |     a
   a | a |     a
   a |   |    -a

I think we would always prefer putting deletions at the end, to be consistent.
Gregory Szorc - Nov. 24, 2016, 7:22 p.m.
On Thu, Nov 24, 2016 at 9:52 AM, Jun Wu <quark@fb.com> wrote:

> Excerpts from Augie Fackler's message of 2016-11-17 12:42:26 -0500:
> > My own cursory perfbdiff runs suggest this is a perf wash (using
> > `perfbdiff -m 3041e4d59df2` in the mozilla repo). Queued. Thanks!
>
> I'd mention this series changes the behavior of the diff output. The
> difference was caught by fastannotate test.
>
> See the below table (old: e1d6aa0e4c3a, new: 8836f13e3c5b):
>
>    a | b | old | new
>   --------------------
>    a | a |  a  | -a
>    a | z | +z  |  a
>    a | a |  a  | +z
>      |   | -a  |  a
>   --------------------
>    a | a |     a
>    a | a |     a
>    a |   |    -a
>
> I think we would always prefer putting deletions at the end, to be
> consistent.
>

And herein lies an important distinction: we currently rely on bdiff for
both internal revlog and changegroup delta storage and user presentation
(we iterate bdiff blocks to drive the unified diff printing in
mdiff._unidiff). These are obviously vastly different use cases, with
internal used wanting to optimize for minimal resource consumption in many
cases and user-facing wanting to optimize for human readability.

We can preserve the format of the user-facing diffs by making the bdiff
behavior conditional on various flags or by post-processing the bdiff
output.
Mike Hommey - Nov. 24, 2016, 9:24 p.m.
On Thu, Nov 24, 2016 at 05:52:29PM +0000, Jun Wu wrote:
> Excerpts from Augie Fackler's message of 2016-11-17 12:42:26 -0500:
> > My own cursory perfbdiff runs suggest this is a perf wash (using
> > `perfbdiff -m 3041e4d59df2` in the mozilla repo). Queued. Thanks!
> 
> I'd mention this series changes the behavior of the diff output. The
> difference was caught by fastannotate test.
> 
> See the below table (old: e1d6aa0e4c3a, new: 8836f13e3c5b):
> 
>    a | b | old | new
>   --------------------
>    a | a |  a  | -a
>    a | z | +z  |  a
>    a | a |  a  | +z
>      |   | -a  |  a
>   --------------------
>    a | a |     a
>    a | a |     a
>    a |   |    -a
> 
> I think we would always prefer putting deletions at the end, to be consistent.

Wouldn't
 a
-a
+z
 a

Be preferable to both old and new? That's what plain diff does, by the
way.

Mike
Mads Kiilerich - Nov. 25, 2016, 2:58 p.m.
On 11/24/2016 06:52 PM, Jun Wu wrote:
> Excerpts from Augie Fackler's message of 2016-11-17 12:42:26 -0500:
>> My own cursory perfbdiff runs suggest this is a perf wash (using
>> `perfbdiff -m 3041e4d59df2` in the mozilla repo). Queued. Thanks!
> I'd mention this series changes the behavior of the diff output. The
> difference was caught by fastannotate test.
>
> See the below table (old: e1d6aa0e4c3a, new: 8836f13e3c5b):
>
>     a | b | old | new
>    --------------------
>     a | a |  a  | -a
>     a | z | +z  |  a
>     a | a |  a  | +z
>       |   | -a  |  a
>    --------------------
>     a | a |     a
>     a | a |     a
>     a |   |    -a
>
> I think we would always prefer putting deletions at the end, to be consistent.


I agree that this end result would be nice.

(My patches "bdiff: early pruning of common suffix before doing 
expensive computations" and "bdiff: early pruning of common prefix 
before doing expensive computations" happens to give the obvious "good" 
result in this case ... but is no general solution to the problem.)

Stating the obvious context:
The algorithms and tweaks for making "good" diffs are just heuristics 
and can never be perfect. A counter example doesn't necessarily prove 
anything. But also, it might be worth it to add a tweak...

An idea, described informally: While the current recursive algorithm 
works "best" by selecting middle-most longest common substrings, I guess 
it could make sense to have a post processing step that shifts common 
substrings to the first occurrence in the previous diff chunk. Also, if 
a previous or later non-common range is empty (because it is an 
add/remove), the matching range can be "shifted", perhaps with 
preference of not starting with an "empty" line but prefering the lowest 
amount of leading spaces.

/Mads
Jun Wu - Nov. 25, 2016, 3:07 p.m.
Excerpts from Mads Kiilerich's message of 2016-11-25 15:58:28 +0100:
> On 11/24/2016 06:52 PM, Jun Wu wrote:
> > Excerpts from Augie Fackler's message of 2016-11-17 12:42:26 -0500:
> >> My own cursory perfbdiff runs suggest this is a perf wash (using
> >> `perfbdiff -m 3041e4d59df2` in the mozilla repo). Queued. Thanks!
> > I'd mention this series changes the behavior of the diff output. The
> > difference was caught by fastannotate test.
> >
> > See the below table (old: e1d6aa0e4c3a, new: 8836f13e3c5b):
> >
> >     a | b | old | new
> >    --------------------
> >     a | a |  a  | -a
> >     a | z | +z  |  a
> >     a | a |  a  | +z
> >       |   | -a  |  a
> >    --------------------
> >     a | a |     a
> >     a | a |     a
> >     a |   |    -a
> >
> > I think we would always prefer putting deletions at the end, to be consistent.
> 
> 
> I agree that this end result would be nice.
> 
> (My patches "bdiff: early pruning of common suffix before doing 
> expensive computations" and "bdiff: early pruning of common prefix 
> before doing expensive computations" happens to give the obvious "good" 
> result in this case ... but is no general solution to the problem.)

If you have no easy solution for this. I'd suggest ignore it for now and
probably postpone other ideas about bdiff. I also have some "directional"
ideas on diff algrotihm. I'll try to start the discussion in a few weeks.

> Stating the obvious context:
> The algorithms and tweaks for making "good" diffs are just heuristics 
> and can never be perfect. A counter example doesn't necessarily prove 
> anything. But also, it might be worth it to add a tweak...
> 
> An idea, described informally: While the current recursive algorithm 
> works "best" by selecting middle-most longest common substrings, I guess 
> it could make sense to have a post processing step that shifts common 
> substrings to the first occurrence in the previous diff chunk. Also, if 
> a previous or later non-common range is empty (because it is an 
> add/remove), the matching range can be "shifted", perhaps with 
> preference of not starting with an "empty" line but prefering the lowest 
> amount of leading spaces.
> 
> /Mads
Jun Wu - Nov. 28, 2016, 12:59 a.m.
I think that's "shift_boundaries" in GNU diffutils' "src/analyze.c":

  /* Adjust inserts/deletes of identical lines to join changes
     as much as possible.
     ....
  static void
  shift_boundaries (struct file_data filevec[])

Git's xdiffi.c does something similar.

Excerpts from Mike Hommey's message of 2016-11-25 06:24:37 +0900:
> On Thu, Nov 24, 2016 at 05:52:29PM +0000, Jun Wu wrote:
> > Excerpts from Augie Fackler's message of 2016-11-17 12:42:26 -0500:
> > > My own cursory perfbdiff runs suggest this is a perf wash (using
> > > `perfbdiff -m 3041e4d59df2` in the mozilla repo). Queued. Thanks!
> > 
> > I'd mention this series changes the behavior of the diff output. The
> > difference was caught by fastannotate test.
> > 
> > See the below table (old: e1d6aa0e4c3a, new: 8836f13e3c5b):
> > 
> >    a | b | old | new
> >   --------------------
> >    a | a |  a  | -a
> >    a | z | +z  |  a
> >    a | a |  a  | +z
> >      |   | -a  |  a
> >   --------------------
> >    a | a |     a
> >    a | a |     a
> >    a |   |    -a
> > 
> > I think we would always prefer putting deletions at the end, to be consistent.
> 
> Wouldn't
>  a
> -a
> +z
>  a
> 
> Be preferable to both old and new? That's what plain diff does, by the
> way.
> 
> Mike

Patch

diff --git a/mercurial/bdiff.c b/mercurial/bdiff.c
--- a/mercurial/bdiff.c
+++ b/mercurial/bdiff.c
@@ -184,7 +184,7 @@  static int longest_match(struct bdiff_li
 				mj = j;
 				mk = k;
 			} else if (k == mk) {
-				if (i > mi && i <= half) {
+				if (i > mi && i <= half && j > b1) {
 					/* same match but closer to half */
 					mi = i;
 					mj = j;
diff --git a/tests/test-bdiff.py b/tests/test-bdiff.py
--- a/tests/test-bdiff.py
+++ b/tests/test-bdiff.py
@@ -88,7 +88,7 @@  print("Diff 1 to 3 lines - preference fo
 showdiff('a\n', 'a\n' * 3)
 print("Diff 1 to 5 lines - preference for appending:")
 showdiff('a\n', 'a\n' * 5)
-print("Diff 3 to 1 lines - preference for balanced recursion:")
+print("Diff 3 to 1 lines - preference for removing trailing lines:")
 showdiff('a\n' * 3, 'a\n')
-print("Diff 5 to 1 lines - preference for balanced recursion:")
+print("Diff 5 to 1 lines - preference for removing trailing lines:")
 showdiff('a\n' * 5, 'a\n')
diff --git a/tests/test-bdiff.py.out b/tests/test-bdiff.py.out
--- a/tests/test-bdiff.py.out
+++ b/tests/test-bdiff.py.out
@@ -68,17 +68,15 @@  showdiff(
   'a\na\na\na\na\n'):
  'a\n'
  2 2 '' -> 'a\na\na\na\n'
-Diff 3 to 1 lines - preference for balanced recursion:
+Diff 3 to 1 lines - preference for removing trailing lines:
 showdiff(
   'a\na\na\n',
   'a\n'):
- 0 2 'a\n' -> ''
  'a\n'
- 4 6 'a\n' -> ''
-Diff 5 to 1 lines - preference for balanced recursion:
+ 2 6 'a\na\n' -> ''
+Diff 5 to 1 lines - preference for removing trailing lines:
 showdiff(
   'a\na\na\na\na\n',
   'a\n'):
- 0 4 'a\na\n' -> ''
  'a\n'
- 6 10 'a\na\n' -> ''
+ 2 10 'a\na\na\na\n' -> ''