Patchwork patch: do not break up multibyte character when highlighting word

login
register
mail settings
Submitter Yuya Nishihara
Date Dec. 11, 2017, 1:57 p.m.
Message ID <82c3762349acec00498f.1513000656@mimosa>
Download mbox | patch
Permalink /patch/26215/
State Accepted
Headers show

Comments

Yuya Nishihara - Dec. 11, 2017, 1:57 p.m.
# HG changeset patch
# User Yuya Nishihara <yuya@tcha.org>
# Date 1512999511 -32400
#      Mon Dec 11 22:38:31 2017 +0900
# Node ID 82c3762349acec00498f71cc463ff2751ba036f1
# Parent  dfae14354660d240517a37cdeb785dd587d343fa
patch: do not break up multibyte character when highlighting word

This changes {\W} to {\W - any 8bit characters} so that multibyte sequences
are taken as words. Since we don't know the encoding of user content, this
is the most sensible definition of a non-word.
Augie Fackler - Dec. 11, 2017, 2:41 p.m.
> On Dec 11, 2017, at 08:57, Yuya Nishihara <yuya@tcha.org> wrote:
> 
> # HG changeset patch
> # User Yuya Nishihara <yuya@tcha.org>
> # Date 1512999511 -32400
> #      Mon Dec 11 22:38:31 2017 +0900
> # Node ID 82c3762349acec00498f71cc463ff2751ba036f1
> # Parent  dfae14354660d240517a37cdeb785dd587d343fa
> patch: do not break up multibyte character when highlighting word

queued, thanks

Patch

diff --git a/mercurial/patch.py b/mercurial/patch.py
--- a/mercurial/patch.py
+++ b/mercurial/patch.py
@@ -46,6 +46,7 @@  stringio = util.stringio
 
 gitre = re.compile(br'diff --git a/(.*) b/(.*)')
 tabsplitter = re.compile(br'(\t+|[^\t]+)')
+_nonwordre = re.compile(br'([^a-zA-Z0-9_\x80-\xff])')
 
 PatchError = error.PatchError
 
@@ -2578,7 +2579,7 @@  def _inlinediff(s1, s2, operation):
         raise error.ProgrammingError("Case not expected, operation = %s" %
                                      operation)
 
-    s = difflib.ndiff(re.split(br'(\W)', s2), re.split(br'(\W)', s1))
+    s = difflib.ndiff(_nonwordre.split(s2), _nonwordre.split(s1))
     for part in s:
         if part[0] in operation_skip or len(part) == 2:
             continue
diff --git a/tests/test-diff-color.t b/tests/test-diff-color.t
--- a/tests/test-diff-color.t
+++ b/tests/test-diff-color.t
@@ -370,3 +370,23 @@  test inline color diff
   [diff.deleted|-(to see if it works)]
   [diff.inserted|+three of those lines ][diff.inserted.highlight|have]
   [diff.inserted|+][diff.inserted.highlight|collapsed][diff.inserted| onto one]
+
+multibyte character shouldn't be broken up in word diff:
+
+  $ $PYTHON <<'EOF'
+  > with open("utf8", "wb") as f:
+  >     f.write(b"blah \xe3\x82\xa2 blah\n")
+  > EOF
+  $ hg ci -Am 'add utf8 char' utf8
+  $ $PYTHON <<'EOF'
+  > with open("utf8", "wb") as f:
+  >     f.write(b"blah \xe3\x82\xa4 blah\n")
+  > EOF
+  $ hg ci -m 'slightly change utf8 char' utf8
+  $ hg diff --config experimental.worddiff=True --color=debug -c.
+  [diff.diffline|diff --git a/utf8 b/utf8]
+  [diff.file_a|--- a/utf8]
+  [diff.file_b|+++ b/utf8]
+  [diff.hunk|@@ -1,1 +1,1 @@]
+  [diff.deleted|-blah ][diff.deleted.highlight|\xe3\x82\xa2][diff.deleted| blah] (esc)
+  [diff.inserted|+blah ][diff.inserted.highlight|\xe3\x82\xa4][diff.inserted| blah] (esc)