Patchwork [5,of,5,STABLE,V2] check-code: detect incorrect format operation on the string not including "%"

login
register
mail settings
Submitter Katsunori FUJIWARA
Date April 15, 2014, 6:12 p.m.
Message ID <6b72aa5a88ead264ff6a.1397585553@juju>
Download mbox | patch
Permalink /patch/4375/
State Deferred
Headers show

Comments

Katsunori FUJIWARA - April 15, 2014, 6:12 p.m.
# HG changeset patch
# User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
# Date 1397585100 -32400
#      Wed Apr 16 03:05:00 2014 +0900
# Branch stable
# Node ID 6b72aa5a88ead264ff6a6b17e9b4f14270198903
# Parent  565620d1b409dcd0c46c462e1ad66c622b2a8b85
check-code: detect incorrect format operation on the string not including "%"

This patch introduces new patterns to detect incorrect format
operations on the string not including any "%" specifiers: it causes
TypeError at runtime.

Patterns newly introduced by this patch treat format operations in
the style below:

    "format-string" % value

Newly introduced patterns base on the ones introduced by the preceding
patch for the similar problems in '("format string") % value' style.

Main difference between them is that this patch uses "[ xnopq]"
instead of "[^%']" to match against 'characters other than "%" in
string' correctly.

For example, Python interpreter (and programmers) can recognize two
strings (1) 'preceding string' and (2) '%s format' in the code below:

    func('preceding string', '%s format')

But regexp matching may (mis-)recognize that a string ', ' (not
including any "%" specifiers) and a formatting operation "%s" (from
'%s format') on it in this code unexpectedly.

After normalizing by "repquote()" in check-code.py, strings in Python
code doesn't contain characters other than ' ', 'x', 'n', 'o', 'p',
'q' or '%'. For example, the example code above is normalized like as
below:

    func('xxxxxxxxx xxxxxx', '%x xxxxxx')

Then , "'[^%']+'" still can't avoid unexpected matching against ', ',
but "'[ xnopq]+'" can. This is because of using "[ xnopq]" instead of
"[^%']" in this patch.

"[^%']" is simple enough for '("format-string") % value' style treated
in preceding patch, because anchoring to "(" and ")" avoids unexpected
matching described above.

Newly introduced patterns mis-recognize that format operation '%x' is
applied on the string ' ' in the example code below (after normalizing
by "repquote()"), because "[ xnopq]" matches against it: in the
example above, comma (",") between strings prevents from
mis-recognizing.

    print 'xxxxxxxxx xxxxxx' '%x xxxxxx'

Changing normalization result for ' ' in "repquote()" from ' ' to 'x'
(or so) can resolve this problem, but it breaks examinations for rst
syntax.

This patch chooses to show hint information "(or use '+' for string
concatenation to avoid ambiguous cases)", because this is very rare
case and current Mercurial implementation doesn't have such code.
Matt Mackall - April 17, 2014, 9:52 p.m.
On Wed, 2014-04-16 at 03:12 +0900, FUJIWARA Katsunori wrote:
> # HG changeset patch
> # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> # Date 1397585100 -32400
> #      Wed Apr 16 03:05:00 2014 +0900
> # Branch stable
> # Node ID 6b72aa5a88ead264ff6a6b17e9b4f14270198903
> # Parent  565620d1b409dcd0c46c462e1ad66c622b2a8b85
> check-code: detect incorrect format operation on the string not including "%"

I'm going to hold off on 4 and 5 as they catch any current breakage.
Please revisit this post-3.0, thanks.

Patch

diff --git a/contrib/check-code.py b/contrib/check-code.py
--- a/contrib/check-code.py
+++ b/contrib/check-code.py
@@ -204,6 +204,12 @@ 
      "don't apply % on non-format string"),
     (r"\([ \t\n]*(?:'(?:[^%']|%%)+'[ \t\n+]*)+\)[ \t\n]*%",
      "don't apply % on non-format string"),
+    (r'[^"\t +][ \t\n]*(?:"(?:[ xnopq]|%%)+"[ \t\n+]*)+%',
+     "don't apply % on non-format string\n" +
+     " (or use '+' for string concatenation to avoid ambiguous cases)"),
+    (r"[^'\t +][ \t\n]*(?:'(?:[ xnopq]|%%)+'[ \t\n+]*)+%",
+     "don't apply % on non-format string\n" +
+     " (or use '+' for string concatenation to avoid ambiguous cases)"),
     (r'(\w|\)),\w', "missing whitespace after ,"),
     (r'(\w|\))[+/*\-<>]\w', "missing whitespace in expression"),
     (r'^\s+(\w|\.)+=\w[^,()\n]*$', "missing whitespace in assignment"),
diff --git a/tests/test-check-code.t b/tests/test-check-code.t
--- a/tests/test-check-code.t
+++ b/tests/test-check-code.t
@@ -313,4 +313,54 @@ 
   ./invalid-formatting.py:11:
    > print ("%% doesn't work as percent") % v
    don't apply % on non-format string
-  [1]
+  ./invalid-formatting.py:14:
+   > print ("ambiguous case (" " %Y-%M-%D")
+   don't apply % on non-format string
+   (or use '+' for string concatenation to avoid ambiguous cases)
+  [1]
+
+  $ cat > ./invalid-formatting2.py <<EOF
+  > print "no percent character" % v
+  > print "no " "percent " "character" % v
+  > print "no " + "percent " + "character" % v
+  > 
+  > print "%% doesn't work as percent" % v
+  > 
+  > print "%s " "ambiguous " "case" % v
+  > print "ambiguous case, %% and %s" % v
+  > print "ambiguous case (", " %Y-%M-%D"
+  > 
+  > print "    ambiguous case"   " %Y-%M-%D"
+  > print "non-ambiguous case" + " %Y-%M-%D"
+  > 
+  > func("preceding string", "%s "
+  >      " valid"
+  >      " case"
+  >      % v,
+  >      "%s " +
+  >      "valid" +
+  >      "case"
+  >      % v)
+  > EOF
+  $ "$check_code" ./invalid-formatting2.py
+  ./invalid-formatting2.py:1:
+   > print "no percent character" % v
+   don't apply % on non-format string
+   (or use '+' for string concatenation to avoid ambiguous cases)
+  ./invalid-formatting2.py:2:
+   > print "no " "percent " "character" % v
+   don't apply % on non-format string
+   (or use '+' for string concatenation to avoid ambiguous cases)
+  ./invalid-formatting2.py:3:
+   > print "no " + "percent " + "character" % v
+   don't apply % on non-format string
+   (or use '+' for string concatenation to avoid ambiguous cases)
+  ./invalid-formatting2.py:5:
+   > print "%% doesn't work as percent" % v
+   don't apply % on non-format string
+   (or use '+' for string concatenation to avoid ambiguous cases)
+  ./invalid-formatting2.py:11:
+   > print "    ambiguous case"   " %Y-%M-%D"
+   don't apply % on non-format string
+   (or use '+' for string concatenation to avoid ambiguous cases)
+  [1]