Patchwork diff: use a threshold on similarity index before using word-diff (issue5965)

login
register
mail settings
Submitter Denis Laxalde
Date Aug. 21, 2018, 12:10 p.m.
Message ID <c43df6ff42d26163d19e.1534853433@steppe.logilab.fr>
Download mbox | patch
Permalink /patch/33935/
State New
Headers show

Comments

Denis Laxalde - Aug. 21, 2018, 12:10 p.m.
# HG changeset patch
# User Denis Laxalde <denis.laxalde@logilab.fr>
# Date 1534853203 -7200
#      Tue Aug 21 14:06:43 2018 +0200
# Node ID c43df6ff42d26163d19e99e15a3cf3094020d822
# Parent  c62184c6299c09d2e8e7be340f9aee138229cb86
# Available At http://hg.logilab.org/users/dlaxalde/hg
#              hg pull http://hg.logilab.org/users/dlaxalde/hg -r c43df6ff42d2
# EXP-Topic issue5965
diff: use a threshold on similarity index before using word-diff (issue5965)

The threshold is chosen quite arbitrarily with a value of 0.5. It does
not change the results of test-diff-color.t whereas higher values (e.g.
0.6) would. Looking at what this produces on some changesets in recent
history (e.g. 037debbf869c or 7acec9408e1c), this significantly improves
Yuya Nishihara - Aug. 21, 2018, 1:20 p.m.
On Tue, 21 Aug 2018 14:10:33 +0200, Denis Laxalde wrote:
> # HG changeset patch
> # User Denis Laxalde <denis.laxalde@logilab.fr>
> # Date 1534853203 -7200
> #      Tue Aug 21 14:06:43 2018 +0200
> # Node ID c43df6ff42d26163d19e99e15a3cf3094020d822
> # Parent  c62184c6299c09d2e8e7be340f9aee138229cb86
> # Available At http://hg.logilab.org/users/dlaxalde/hg
> #              hg pull http://hg.logilab.org/users/dlaxalde/hg -r c43df6ff42d2
> # EXP-Topic issue5965
> diff: use a threshold on similarity index before using word-diff (issue5965)
> 
> The threshold is chosen quite arbitrarily with a value of 0.5. It does
> not change the results of test-diff-color.t whereas higher values (e.g.
> 0.6) would. Looking at what this produces on some changesets in recent
> history (e.g. 037debbf869c or 7acec9408e1c), this significantly improves
> diff readability.
> 
> Similarity index is computed using difflib.SequenceMatcher's ratio()
> method; this is documented as being "expensive", but other faster methods
> (that compute an upper bound value) do not give good results.
> Nevertheless, since we compute this ratio on each hunk which are usually
> small, this might not be problematic in most cases. Also, as we'd
> short-circuit computation of inline colors for those hunks that are not
> similar enough, this "expensive" ratio computation might also be
> compensated.

Can you test this against a large BLOB-ish diff (such as machine-generated
10k-line JSON, a binary in Intel HEX format, etc.)? Last time I faced that,
the original difflib-based algorithm was painfully slow (~100s-ish to yield
one hunk), which made me think the word-diff should never be turned on by
default.

Also, for me, it was a bad UX that the word-diff was disabled at an arbitrary
threshold. I needed to guess reason why a hunk wasn't highlighted.
Denis Laxalde - Aug. 21, 2018, 3:11 p.m.
Yuya Nishihara a écrit :
> On Tue, 21 Aug 2018 14:10:33 +0200, Denis Laxalde wrote:
>> # HG changeset patch
>> # User Denis Laxalde <denis.laxalde@logilab.fr>
>> # Date 1534853203 -7200
>> #      Tue Aug 21 14:06:43 2018 +0200
>> # Node ID c43df6ff42d26163d19e99e15a3cf3094020d822
>> # Parent  c62184c6299c09d2e8e7be340f9aee138229cb86
>> # Available At http://hg.logilab.org/users/dlaxalde/hg
>> #              hg pull http://hg.logilab.org/users/dlaxalde/hg -r c43df6ff42d2
>> # EXP-Topic issue5965
>> diff: use a threshold on similarity index before using word-diff (issue5965)
>>
>> The threshold is chosen quite arbitrarily with a value of 0.5. It does
>> not change the results of test-diff-color.t whereas higher values (e.g.
>> 0.6) would. Looking at what this produces on some changesets in recent
>> history (e.g. 037debbf869c or 7acec9408e1c), this significantly improves
>> diff readability.
>>
>> Similarity index is computed using difflib.SequenceMatcher's ratio()
>> method; this is documented as being "expensive", but other faster methods
>> (that compute an upper bound value) do not give good results.
>> Nevertheless, since we compute this ratio on each hunk which are usually
>> small, this might not be problematic in most cases. Also, as we'd
>> short-circuit computation of inline colors for those hunks that are not
>> similar enough, this "expensive" ratio computation might also be
>> compensated.
> 
> Can you test this against a large BLOB-ish diff (such as machine-generated
> 10k-line JSON, a binary in Intel HEX format, etc.)? Last time I faced that,
> the original difflib-based algorithm was painfully slow (~100s-ish to yield
> one hunk), which made me think the word-diff should never be turned on by
> default.

I've set up a test repo with some JSON at
https://bitbucket.org/dlax/hg-worddiff-tests. As far as I can tell,
there's no significant difference when diffing the last changeset; it's
even a bit faster with the similarity ratio patch. Is this what you had
in mind?


> Also, for me, it was a bad UX that the word-diff was disabled at an arbitrary
> threshold. I needed to guess reason why a hunk wasn't highlighted.

I don't get this. Before this patch, word-diff is also on side of the
hunk is empty IIUC. Do you mean we should make the threshold configurable?
Yuya Nishihara - Aug. 22, 2018, 12:35 p.m.
On Tue, 21 Aug 2018 17:11:51 +0200, Denis Laxalde wrote:
> Yuya Nishihara a écrit :
> > On Tue, 21 Aug 2018 14:10:33 +0200, Denis Laxalde wrote:
> >> # HG changeset patch
> >> # User Denis Laxalde <denis.laxalde@logilab.fr>
> >> # Date 1534853203 -7200
> >> #      Tue Aug 21 14:06:43 2018 +0200
> >> # Node ID c43df6ff42d26163d19e99e15a3cf3094020d822
> >> # Parent  c62184c6299c09d2e8e7be340f9aee138229cb86
> >> # Available At http://hg.logilab.org/users/dlaxalde/hg
> >> #              hg pull http://hg.logilab.org/users/dlaxalde/hg -r c43df6ff42d2
> >> # EXP-Topic issue5965
> >> diff: use a threshold on similarity index before using word-diff (issue5965)
> >>
> >> The threshold is chosen quite arbitrarily with a value of 0.5. It does
> >> not change the results of test-diff-color.t whereas higher values (e.g.
> >> 0.6) would. Looking at what this produces on some changesets in recent
> >> history (e.g. 037debbf869c or 7acec9408e1c), this significantly improves
> >> diff readability.
> >>
> >> Similarity index is computed using difflib.SequenceMatcher's ratio()
> >> method; this is documented as being "expensive", but other faster methods
> >> (that compute an upper bound value) do not give good results.
> >> Nevertheless, since we compute this ratio on each hunk which are usually
> >> small, this might not be problematic in most cases. Also, as we'd
> >> short-circuit computation of inline colors for those hunks that are not
> >> similar enough, this "expensive" ratio computation might also be
> >> compensated.
> > 
> > Can you test this against a large BLOB-ish diff (such as machine-generated
> > 10k-line JSON, a binary in Intel HEX format, etc.)? Last time I faced that,
> > the original difflib-based algorithm was painfully slow (~100s-ish to yield
> > one hunk), which made me think the word-diff should never be turned on by
> > default.
> 
> I've set up a test repo with some JSON at
> https://bitbucket.org/dlax/hg-worddiff-tests. As far as I can tell,
> there's no significant difference when diffing the last changeset;

Thanks, but it looks cheaper to compute than the stuff I had at work. I'll
try to collect some number if I get a chance.

> it's even a bit faster with the similarity ratio patch.

The overall result may be faster, but my concern is a few sec stall with
no output.

> > Also, for me, it was a bad UX that the word-diff was disabled at an arbitrary
> > threshold. I needed to guess reason why a hunk wasn't highlighted.
> 
> I don't get this. Before this patch, word-diff is also on side of the
> hunk is empty IIUC.

Yup. I don't like it, but that's easy to spot.

> Do you mean we should make the threshold configurable?

Not really, but if I can set the threshold to 0, I'm fine with that.
Yuya Nishihara - Aug. 23, 2018, 12:48 p.m.
On Wed, 22 Aug 2018 21:35:31 +0900, Yuya Nishihara wrote:
> On Tue, 21 Aug 2018 17:11:51 +0200, Denis Laxalde wrote:
> > Yuya Nishihara a écrit :
> > > On Tue, 21 Aug 2018 14:10:33 +0200, Denis Laxalde wrote:
> > >> # HG changeset patch
> > >> # User Denis Laxalde <denis.laxalde@logilab.fr>
> > >> # Date 1534853203 -7200
> > >> #      Tue Aug 21 14:06:43 2018 +0200
> > >> # Node ID c43df6ff42d26163d19e99e15a3cf3094020d822
> > >> # Parent  c62184c6299c09d2e8e7be340f9aee138229cb86
> > >> # Available At http://hg.logilab.org/users/dlaxalde/hg
> > >> #              hg pull http://hg.logilab.org/users/dlaxalde/hg -r c43df6ff42d2
> > >> # EXP-Topic issue5965
> > >> diff: use a threshold on similarity index before using word-diff (issue5965)
> > >>
> > >> The threshold is chosen quite arbitrarily with a value of 0.5. It does
> > >> not change the results of test-diff-color.t whereas higher values (e.g.
> > >> 0.6) would. Looking at what this produces on some changesets in recent
> > >> history (e.g. 037debbf869c or 7acec9408e1c), this significantly improves
> > >> diff readability.
> > >>
> > >> Similarity index is computed using difflib.SequenceMatcher's ratio()
> > >> method; this is documented as being "expensive", but other faster methods
> > >> (that compute an upper bound value) do not give good results.
> > >> Nevertheless, since we compute this ratio on each hunk which are usually
> > >> small, this might not be problematic in most cases. Also, as we'd
> > >> short-circuit computation of inline colors for those hunks that are not
> > >> similar enough, this "expensive" ratio computation might also be
> > >> compensated.
> > > 
> > > Can you test this against a large BLOB-ish diff (such as machine-generated
> > > 10k-line JSON, a binary in Intel HEX format, etc.)? Last time I faced that,
> > > the original difflib-based algorithm was painfully slow (~100s-ish to yield
> > > one hunk), which made me think the word-diff should never be turned on by
> > > default.
> > 
> > I've set up a test repo with some JSON at
> > https://bitbucket.org/dlax/hg-worddiff-tests. As far as I can tell,
> > there's no significant difference when diffing the last changeset;
> 
> Thanks, but it looks cheaper to compute than the stuff I had at work. I'll
> try to collect some number if I get a chance.

  $ hg diff -c REV --color=always --config diff.word-diff=true --time > /dev/null
  (orig) 1.250sec
  (new)  1259.490sec

It's an ASCII-fied FPGA image (called tabular text file), containing ~320k
decimal numbers plus commas (so ~1000k words in our word-diff.) And there
are some large hunks as it is a diff of two similar BLOBs split into chunks
per N bytes.
Yuya Nishihara - Aug. 24, 2018, 2:54 p.m.
On Fri, 24 Aug 2018 10:05:14 +0200, Boris FELD wrote:
> On 23/08/2018 14:48, Yuya Nishihara wrote:
> > On Wed, 22 Aug 2018 21:35:31 +0900, Yuya Nishihara wrote:
> >> On Tue, 21 Aug 2018 17:11:51 +0200, Denis Laxalde wrote:
> >>> Yuya Nishihara a écrit :
> >>>> On Tue, 21 Aug 2018 14:10:33 +0200, Denis Laxalde wrote:
> >>>>> # HG changeset patch
> >>>>> # User Denis Laxalde <denis.laxalde@logilab.fr>
> >>>>> # Date 1534853203 -7200
> >>>>> #      Tue Aug 21 14:06:43 2018 +0200
> >>>>> # Node ID c43df6ff42d26163d19e99e15a3cf3094020d822
> >>>>> # Parent  c62184c6299c09d2e8e7be340f9aee138229cb86
> >>>>> # Available At http://hg.logilab.org/users/dlaxalde/hg
> >>>>> #              hg pull http://hg.logilab.org/users/dlaxalde/hg -r c43df6ff42d2
> >>>>> # EXP-Topic issue5965
> >>>>> diff: use a threshold on similarity index before using word-diff (issue5965)
> >>>>>
> >>>>> The threshold is chosen quite arbitrarily with a value of 0.5. It does
> >>>>> not change the results of test-diff-color.t whereas higher values (e.g.
> >>>>> 0.6) would. Looking at what this produces on some changesets in recent
> >>>>> history (e.g. 037debbf869c or 7acec9408e1c), this significantly improves
> >>>>> diff readability.
> >>>>>
> >>>>> Similarity index is computed using difflib.SequenceMatcher's ratio()
> >>>>> method; this is documented as being "expensive", but other faster methods
> >>>>> (that compute an upper bound value) do not give good results.
> >>>>> Nevertheless, since we compute this ratio on each hunk which are usually
> >>>>> small, this might not be problematic in most cases. Also, as we'd
> >>>>> short-circuit computation of inline colors for those hunks that are not
> >>>>> similar enough, this "expensive" ratio computation might also be
> >>>>> compensated.
> >>>> Can you test this against a large BLOB-ish diff (such as machine-generated
> >>>> 10k-line JSON, a binary in Intel HEX format, etc.)? Last time I faced that,
> >>>> the original difflib-based algorithm was painfully slow (~100s-ish to yield
> >>>> one hunk), which made me think the word-diff should never be turned on by
> >>>> default.
> >>> I've set up a test repo with some JSON at
> >>> https://bitbucket.org/dlax/hg-worddiff-tests. As far as I can tell,
> >>> there's no significant difference when diffing the last changeset;
> >> Thanks, but it looks cheaper to compute than the stuff I had at work. I'll
> >> try to collect some number if I get a chance.
> >    $ hg diff -c REV --color=always --config diff.word-diff=true --time > /dev/null
> >    (orig) 1.250sec
> >    (new)  1259.490sec
> >
> > It's an ASCII-fied FPGA image (called tabular text file), containing ~320k
> > decimal numbers plus commas (so ~1000k words in our word-diff.) And there
> > are some large hunks as it is a diff of two similar BLOBs split into chunks
> > per N bytes.
> These files seems to have interesting characteristics that would be 
> useful for performance testing.
> 
> Would it be possible to get examples or redacted versions of such files?

(CC the list again)

It's a closed binary, but you can see how slow the difflib is by using random
data.

  $ dd if=/dev/urandom bs=1k count=100 | hexdump -v -e '16/1 "%3u," "\n"'

IIRC, the computation cost of difflib is more sensitive to input data than
Mercurial's bdiff.
Yuya Nishihara - Aug. 24, 2018, 3:10 p.m.
On Fri, 24 Aug 2018 23:54:56 +0900, Yuya Nishihara wrote:
> > > It's an ASCII-fied FPGA image (called tabular text file), containing ~320k
> > > decimal numbers plus commas (so ~1000k words in our word-diff.) And there
> > > are some large hunks as it is a diff of two similar BLOBs split into chunks
> > > per N bytes.
> > These files seems to have interesting characteristics that would be 
> > useful for performance testing.
> > 
> > Would it be possible to get examples or redacted versions of such files?
> 
> (CC the list again)
> 
> It's a closed binary, but you can see how slow the difflib is by using random
> data.
> 
>   $ dd if=/dev/urandom bs=1k count=100 | hexdump -v -e '16/1 "%3u," "\n"'
> 
> IIRC, the computation cost of difflib is more sensitive to input data than
                                                             ^^^^^^^^^^
To be clear, I meant the size of the input data. In this case, len(al) and
len(bl).

> Mercurial's bdiff.

Patch

diff readability.

Similarity index is computed using difflib.SequenceMatcher's ratio()
method; this is documented as being "expensive", but other faster methods
(that compute an upper bound value) do not give good results.
Nevertheless, since we compute this ratio on each hunk which are usually
small, this might not be problematic in most cases. Also, as we'd
short-circuit computation of inline colors for those hunks that are not
similar enough, this "expensive" ratio computation might also be
compensated.

diff --git a/mercurial/patch.py b/mercurial/patch.py
--- a/mercurial/patch.py
+++ b/mercurial/patch.py
@@ -11,6 +11,7 @@  from __future__ import absolute_import, 
 import collections
 import contextlib
 import copy
+import difflib
 import email
 import errno
 import hashlib
@@ -2445,6 +2446,12 @@  def diffsinglehunkinline(hunklines):
     # re-split the content into words
     al = wordsplitter.findall(a)
     bl = wordsplitter.findall(b)
+    # if similarity index between word lists is not high enough, fall back to
+    # diffsinglehunk since word-diff coloring will be useless
+    if difflib.SequenceMatcher(None, al, bl).ratio() < 0.5:
+        for t in diffsinglehunk(hunklines):
+            yield t
+        return
     # re-arrange the words to lines since the diff algorithm is line-based
     aln = [s if s == '\n' else s + '\n' for s in al]
     bln = [s if s == '\n' else s + '\n' for s in bl]
diff --git a/tests/test-diff-color.t b/tests/test-diff-color.t
--- a/tests/test-diff-color.t
+++ b/tests/test-diff-color.t
@@ -304,6 +304,8 @@  test inline color diff
   > three of those lines will
   > collapse onto one
   > (to see if it works)
+  > 
+  > this line will almost completely change
   > EOF
   $ hg add file1
   $ hg ci -m 'commit'
@@ -326,12 +328,14 @@  test inline color diff
   > 
   > three of those lines have
   > collapsed onto one
+  > 
+  > so many things have changed in this line
   > EOF
   $ hg diff --config diff.word-diff=False --color=debug
   [diff.diffline|diff --git a/file1 b/file1]
   [diff.file_a|--- a/file1]
   [diff.file_b|+++ b/file1]
-  [diff.hunk|@@ -1,16 +1,17 @@]
+  [diff.hunk|@@ -1,18 +1,19 @@]
   [diff.deleted|-this is the first line]
   [diff.deleted|-this is the second line]
   [diff.deleted|-    third line starts with space]
@@ -360,11 +364,14 @@  test inline color diff
   [diff.deleted|-(to see if it works)]
   [diff.inserted|+three of those lines have]
   [diff.inserted|+collapsed onto one]
+   
+  [diff.deleted|-this line will almost completely change]
+  [diff.inserted|+so many things have changed in this line]
   $ hg diff --config diff.word-diff=True --color=debug
   [diff.diffline|diff --git a/file1 b/file1]
   [diff.file_a|--- a/file1]
   [diff.file_b|+++ b/file1]
-  [diff.hunk|@@ -1,16 +1,17 @@]
+  [diff.hunk|@@ -1,18 +1,19 @@]
   [diff.deleted|-][diff.deleted.changed|this][diff.deleted.unchanged| is the first ][diff.deleted.changed|line]
   [diff.deleted|-][diff.deleted.unchanged|this is the second line]
   [diff.deleted|-][diff.deleted.changed|    ][diff.deleted.unchanged|third line starts with space]
@@ -393,6 +400,9 @@  test inline color diff
   [diff.deleted|-][diff.deleted.changed|(to see if it works)]
   [diff.inserted|+][diff.inserted.unchanged|three of those lines ][diff.inserted.changed|have]
   [diff.inserted|+][diff.inserted.changed|collapsed][diff.inserted.unchanged| onto one]
+   
+  [diff.deleted|-this line will almost completely change]
+  [diff.inserted|+so many things have changed in this line]
 
 multibyte character shouldn't be broken up in word diff: