Patchwork [STABLE] tests: escape bytes setting MSB in input of grep for portability

login
register
mail settings
Submitter Katsunori FUJIWARA
Date May 20, 2016, 5:53 p.m.
Message ID <8c5e880c7e25e94354d3.1463766831@juju>
Download mbox | patch
Permalink /patch/15178/
State Accepted
Headers show

Comments

Katsunori FUJIWARA - May 20, 2016, 5:53 p.m.
# HG changeset patch
# User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
# Date 1463766531 -32400
#      Sat May 21 02:48:51 2016 +0900
# Branch stable
# Node ID 8c5e880c7e25e94354d312d582d2ba19ca419423
# Parent  854556c5f3bf6493a99481a355c5112b2ea0ed37
tests: escape bytes setting MSB in input of grep for portability

GNU grep (2.21-2 or later) assumes that input is encoded in LC_CTYPE,
and input is binary if it contains byte sequence not valid for that
encoding.

For example, if locale is configured as C, a byte setting most
significant bit (MSB) makes such GNU grep show "Binary file <FILENAME>
matches" message instead of matched lines unintentionally.

This behavior is recognized as a bug, and fixed in GNU grep 2.25-1 or
later. But some distributions are shipped with such buggy version
(e.g. Ubuntu xenial, which is used by launchpad buildbot).

    http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19230
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800670
    http://packages.ubuntu.com/xenial/grep

This causes failure of test-commit-interactive.t, which applies grep
on CP932 byte sequence since 1111e84de635.

But, explicit setting LC_CTYPE for CP932 might cause another problem,
because it can't be assumed that all environment running Mercurial
tests allows arbitrary locale setting.

To resolve this issue, this patch escapes bytes setting MSB in input
of grep.

For this purpose:

  - str.encode('string-escape') isn't useful, because it escapes also
    control code (less than 0x20), and makes EOL handling complicated

  - "f --hexdump" isn't useful, because it isn't line-oriented

  - "sed -n" seems reasonable, but "sed" itself sometimes causes
    portability issue, too (e.g. 900767dfa80d or afb86ee925bf)

This patch is posted with "stable" flag, because 1111e84de635 is on
stable branch.
Anton Shestakov - May 21, 2016, 2:07 a.m.
21.05.2016, 02:08, "FUJIWARA Katsunori" <foozy@lares.dti.ne.jp>:
> # HG changeset patch
> # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> # Date 1463766531 -32400
> # Sat May 21 02:48:51 2016 +0900
> # Branch stable
> # Node ID 8c5e880c7e25e94354d312d582d2ba19ca419423
> # Parent 854556c5f3bf6493a99481a355c5112b2ea0ed37
> tests: escape bytes setting MSB in input of grep for portability
>
> GNU grep (2.21-2 or later) assumes that input is encoded in LC_CTYPE,
> and input is binary if it contains byte sequence not valid for that
> encoding.
>
> For example, if locale is configured as C, a byte setting most
> significant bit (MSB) makes such GNU grep show "Binary file <FILENAME>
> matches" message instead of matched lines unintentionally.
>
> This behavior is recognized as a bug, and fixed in GNU grep 2.25-1 or
> later. But some distributions are shipped with such buggy version
> (e.g. Ubuntu xenial, which is used by launchpad buildbot).
>
>     http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19230
>     https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800670
>     http://packages.ubuntu.com/xenial/grep
>
> This causes failure of test-commit-interactive.t, which applies grep
> on CP932 byte sequence since 1111e84de635.
>
> But, explicit setting LC_CTYPE for CP932 might cause another problem,
> because it can't be assumed that all environment running Mercurial
> tests allows arbitrary locale setting.
>
> To resolve this issue, this patch escapes bytes setting MSB in input
> of grep.
>
> For this purpose:
>
>   - str.encode('string-escape') isn't useful, because it escapes also
>     control code (less than 0x20), and makes EOL handling complicated
>
>   - "f --hexdump" isn't useful, because it isn't line-oriented
>
>   - "sed -n" seems reasonable, but "sed" itself sometimes causes
>     portability issue, too (e.g. 900767dfa80d or afb86ee925bf)
>
> This patch is posted with "stable" flag, because 1111e84de635 is on
> stable branch.
>
> diff --git a/tests/test-commit-interactive.t b/tests/test-commit-interactive.t
> --- a/tests/test-commit-interactive.t
> +++ b/tests/test-commit-interactive.t
> @@ -895,11 +895,24 @@ This tests that translated help message
>    $ LANGUAGE=ja
>    $ export LANGUAGE
>
> - $ hg commit -i --encoding cp932 2>&1 <<EOF | grep '^y - '
> + $ cat > $TESTTMP/escape.py <<EOF
> + > from __future__ import absolute_import
> + > import sys
> + > def escape(c):
> + > o = ord(c)
> + > if o < 0x80:
> + > return c
> + > else:
> + > return r'\x%02x' % o # escape char setting MSB
> + > for l in sys.stdin:
> + > sys.stdout.write(''.join(escape(c) for c in l))
> + > EOF
> +
> + $ hg commit -i --encoding cp932 2>&1 <<EOF | python $TESTTMP/escape.py | grep '^y - '
>    > ?
>    > q
>    > EOF
> - y - \x82\xb1\x82\xcc\x95\xcf\x8dX\x82\xf0\x8bL\x98^(yes) (esc)
> + y - \x82\xb1\x82\xcc\x95\xcf\x8dX\x82\xf0\x8bL\x98^(yes)
>
>    $ LANGUAGE=
>  #endif

Passes on a Xenial box in Vagrant.
Augie Fackler - May 23, 2016, 7:13 p.m.
On Sat, May 21, 2016 at 02:53:51AM +0900, FUJIWARA Katsunori wrote:
> # HG changeset patch
> # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> # Date 1463766531 -32400
> #      Sat May 21 02:48:51 2016 +0900
> # Branch stable
> # Node ID 8c5e880c7e25e94354d312d582d2ba19ca419423
> # Parent  854556c5f3bf6493a99481a355c5112b2ea0ed37
> tests: escape bytes setting MSB in input of grep for portability

queued for stable, thanks

>
> GNU grep (2.21-2 or later) assumes that input is encoded in LC_CTYPE,
> and input is binary if it contains byte sequence not valid for that
> encoding.
>
> For example, if locale is configured as C, a byte setting most
> significant bit (MSB) makes such GNU grep show "Binary file <FILENAME>
> matches" message instead of matched lines unintentionally.
>
> This behavior is recognized as a bug, and fixed in GNU grep 2.25-1 or
> later. But some distributions are shipped with such buggy version
> (e.g. Ubuntu xenial, which is used by launchpad buildbot).
>
>     http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19230
>     https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800670
>     http://packages.ubuntu.com/xenial/grep
>
> This causes failure of test-commit-interactive.t, which applies grep
> on CP932 byte sequence since 1111e84de635.
>
> But, explicit setting LC_CTYPE for CP932 might cause another problem,
> because it can't be assumed that all environment running Mercurial
> tests allows arbitrary locale setting.
>
> To resolve this issue, this patch escapes bytes setting MSB in input
> of grep.
>
> For this purpose:
>
>   - str.encode('string-escape') isn't useful, because it escapes also
>     control code (less than 0x20), and makes EOL handling complicated
>
>   - "f --hexdump" isn't useful, because it isn't line-oriented
>
>   - "sed -n" seems reasonable, but "sed" itself sometimes causes
>     portability issue, too (e.g. 900767dfa80d or afb86ee925bf)
>
> This patch is posted with "stable" flag, because 1111e84de635 is on
> stable branch.
>
> diff --git a/tests/test-commit-interactive.t b/tests/test-commit-interactive.t
> --- a/tests/test-commit-interactive.t
> +++ b/tests/test-commit-interactive.t
> @@ -895,11 +895,24 @@ This tests that translated help message
>    $ LANGUAGE=ja
>    $ export LANGUAGE
>
> -  $ hg commit -i --encoding cp932 2>&1 <<EOF | grep '^y - '
> +  $ cat > $TESTTMP/escape.py <<EOF
> +  > from __future__ import absolute_import
> +  > import sys
> +  > def escape(c):
> +  >     o = ord(c)
> +  >     if o < 0x80:
> +  >         return c
> +  >     else:
> +  >         return r'\x%02x' % o # escape char setting MSB
> +  > for l in sys.stdin:
> +  >     sys.stdout.write(''.join(escape(c) for c in l))
> +  > EOF
> +
> +  $ hg commit -i --encoding cp932 2>&1 <<EOF | python $TESTTMP/escape.py | grep '^y - '
>    > ?
>    > q
>    > EOF
> -  y - \x82\xb1\x82\xcc\x95\xcf\x8dX\x82\xf0\x8bL\x98^(yes) (esc)
> +  y - \x82\xb1\x82\xcc\x95\xcf\x8dX\x82\xf0\x8bL\x98^(yes)
>
>    $ LANGUAGE=
>  #endif
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Patch

diff --git a/tests/test-commit-interactive.t b/tests/test-commit-interactive.t
--- a/tests/test-commit-interactive.t
+++ b/tests/test-commit-interactive.t
@@ -895,11 +895,24 @@  This tests that translated help message 
   $ LANGUAGE=ja
   $ export LANGUAGE
 
-  $ hg commit -i --encoding cp932 2>&1 <<EOF | grep '^y - '
+  $ cat > $TESTTMP/escape.py <<EOF
+  > from __future__ import absolute_import
+  > import sys
+  > def escape(c):
+  >     o = ord(c)
+  >     if o < 0x80:
+  >         return c
+  >     else:
+  >         return r'\x%02x' % o # escape char setting MSB
+  > for l in sys.stdin:
+  >     sys.stdout.write(''.join(escape(c) for c in l))
+  > EOF
+
+  $ hg commit -i --encoding cp932 2>&1 <<EOF | python $TESTTMP/escape.py | grep '^y - '
   > ?
   > q
   > EOF
-  y - \x82\xb1\x82\xcc\x95\xcf\x8dX\x82\xf0\x8bL\x98^(yes) (esc)
+  y - \x82\xb1\x82\xcc\x95\xcf\x8dX\x82\xf0\x8bL\x98^(yes)
 
   $ LANGUAGE=
 #endif