Patchwork [1,of,2,STABLE,V3] util: add "shellsplit()" to centralize the logic to split command line

login
register
mail settings
Submitter Katsunori FUJIWARA
Date Dec. 5, 2014, 12:14 p.m.
Message ID <73b764f8fdf2c33ad567.1417781676@juju>
Download mbox | patch
Permalink /patch/7012/
State Superseded
Headers show

Comments

Katsunori FUJIWARA - Dec. 5, 2014, 12:14 p.m.
# HG changeset patch
# User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
# Date 1417781174 -32400
#      Fri Dec 05 21:06:14 2014 +0900
# Branch stable
# Node ID 73b764f8fdf2c33ad56795ef2513d9687bd14734
# Parent  57d35d3c1cf170513cc150c5021f5e3e0d7cdafb
util: add "shellsplit()" to centralize the logic to split command line

This is the preparation for fixing issue4463.

This patch uses StringIO object to know exactly where the first shell
delimiter character is placed in the case of "all=False", because
"shlex.next()" returns the de-quoted string and this makes eliminating
it from original string difficult in some complicated cases.

For example, "shlex.next()" for '"foo"/"foo" bar baz' returns 'foo/foo'.

On the other hand, "StringIO.tell()" allows us to know exactly what
characters are not yet read in by shlex.

This patch uses StringIO object also in the case of "all=True" but
this doesn't have any problematic impact, because the shlex
constructor itself creates StringIO object internally, if string
object is passed as the parse target.

Another patch series for "default" branch will replace existing
"shlex.split()" invocations by "util.shellsplit()" (and add new
"check-code.py" rule to prevent from using "shlex.split()").
Mads Kiilerich - Dec. 5, 2014, 2:54 p.m.
On 12/05/2014 01:14 PM, FUJIWARA Katsunori wrote:
> # HG changeset patch
> # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> # Date 1417781174 -32400
> #      Fri Dec 05 21:06:14 2014 +0900
> # Branch stable
> # Node ID 73b764f8fdf2c33ad56795ef2513d9687bd14734
> # Parent  57d35d3c1cf170513cc150c5021f5e3e0d7cdafb
> util: add "shellsplit()" to centralize the logic to split command line

It is not clear to me: What can this function do that shlex.split cannot?

Is it also relevant when on posix systems or could it (or most of the 
implementation) live in windows.py?

/Mads

> This is the preparation for fixing issue4463.
>
> This patch uses StringIO object to know exactly where the first shell
> delimiter character is placed in the case of "all=False", because
> "shlex.next()" returns the de-quoted string and this makes eliminating
> it from original string difficult in some complicated cases.
>
> For example, "shlex.next()" for '"foo"/"foo" bar baz' returns 'foo/foo'.
>
> On the other hand, "StringIO.tell()" allows us to know exactly what
> characters are not yet read in by shlex.
>
> This patch uses StringIO object also in the case of "all=True" but
> this doesn't have any problematic impact, because the shlex
> constructor itself creates StringIO object internally, if string
> object is passed as the parse target.
>
> Another patch series for "default" branch will replace existing
> "shlex.split()" invocations by "util.shellsplit()" (and add new
> "check-code.py" rule to prevent from using "shlex.split()").
>
Katsunori FUJIWARA - Dec. 6, 2014, 2:21 p.m.
At Fri, 05 Dec 2014 15:54:19 +0100,
Mads Kiilerich wrote:
> 
> On 12/05/2014 01:14 PM, FUJIWARA Katsunori wrote:
> > # HG changeset patch
> > # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> > # Date 1417781174 -32400
> > #      Fri Dec 05 21:06:14 2014 +0900
> > # Branch stable
> > # Node ID 73b764f8fdf2c33ad56795ef2513d9687bd14734
> > # Parent  57d35d3c1cf170513cc150c5021f5e3e0d7cdafb
> > util: add "shellsplit()" to centralize the logic to split command line
> 
> It is not clear to me: What can this function do that shlex.split cannot?

"shlex.split" can't split the specified string into "the 1st component
(as the command to be executed)" and the rest. It always split whole
string into components, and discards quoting.

It is important that "the rest" keeps quoting in the configuration
file, because "shlex.split + util.shellquote" combination may change
meaning of original configuration.

For example: 

  - stupid Windows application can't recognize quoted option correctly

    We know this problem as issue4463: 'winmerge /r ....' works
    correctly, but 'winmerge "/r" ...' doesn't.

    In this case, making "windows.shellquote" quote only components
    containing space characters is effective.

  - environment variable may contain space characters

    (this problem can occur also on POSIX platform)

    when FOOBAR='foo bar':

    $FOOBAR   => foo bar   (2 arguments)
    "$FOOBAR" => "foo bar" (1 argument)

    In this case, "quote only components which contain space
    characters" isn't effective, because there is no space character
    in the focused component before expanding $FOOBAR.

    The root cause of this problem is that "shlex.split +
    util.shellquote" combination loses original context in the
    configuration.

  - "shlex.split + util.shellquote" can't recognize separation
    characters (and redirections, too)

    (this problem can occur also on POSIX platform)

    For example, 'foo 1&&bar 2&&baz 3' is splitted into ['foo',
    '1&&bar', '2&&baz','3'] by "shlex.split".

    "quote only ..." may be effective in this case.

    For another example:

      foo "&&" bar "&&" baz => "foo" takes 4 arguments
      foo && bar && baz     => "foo", "bar" and "baz" are executed separately

    "shlex.split + util.shellquote" combination can't distinguish
    between them correctly.

    "quote only ..." isn't effective in this case.


> Is it also relevant when on posix systems or could it (or most of the 
> implementation) live in windows.py?

"shellsplit()" should be called intendedly (at least in this issue
case), because the caller should use "the rest" without any quoting
(and "shlex.split"-ing).

And as described above, some problems can occur also on POSIX platform.

> /Mads
> 
> > This is the preparation for fixing issue4463.
> >
> > This patch uses StringIO object to know exactly where the first shell
> > delimiter character is placed in the case of "all=False", because
> > "shlex.next()" returns the de-quoted string and this makes eliminating
> > it from original string difficult in some complicated cases.
> >
> > For example, "shlex.next()" for '"foo"/"foo" bar baz' returns 'foo/foo'.
> >
> > On the other hand, "StringIO.tell()" allows us to know exactly what
> > characters are not yet read in by shlex.
> >
> > This patch uses StringIO object also in the case of "all=True" but
> > this doesn't have any problematic impact, because the shlex
> > constructor itself creates StringIO object internally, if string
> > object is passed as the parse target.
> >
> > Another patch series for "default" branch will replace existing
> > "shlex.split()" invocations by "util.shellsplit()" (and add new
> > "check-code.py" rule to prevent from using "shlex.split()").
> >
> 
> 

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy@lares.dti.ne.jp
Mads Kiilerich - Dec. 7, 2014, 3:23 p.m.
Thanks for summarizing the background for this tricky issue. I can see 
how something like these changes are the only way to move forward with 
extdiff.

The changes are however also big and tricky and adds complexity. I get 
the feeling that it probably not will be the last time we find issues in 
this area.

Could we take a step back and do things differently and thus avoid the 
problem? Could we avoid parsing commandlines and options? Couldn't we 
just maintain 'cmdline' as a string where we can substitute or add 
parameters when necessary? (I guess that pretty much also is what Yuya 
suggested.)

As far as I can see, we only parse the command line and options in order 
to always have the command path separate from the options. And it seems 
like the only _real_ reason we do that is that if the configuration for 
CMD doesn't have any options, then we must append the value of 
diff-tools.CMD.diffargs / merge-tools.CMD.diffargs . shlex.split might 
be imperfect but should be capable of telling us whether there are any 
options or not.

/Mads
Pierre-Yves David - Dec. 14, 2014, 8:53 p.m.
As far as I understand, this series is/will be super seeded by something 
different and I'me dropping it from patchwork.

Patch

diff --git a/mercurial/util.py b/mercurial/util.py
--- a/mercurial/util.py
+++ b/mercurial/util.py
@@ -19,7 +19,7 @@  import error, osutil, encoding
 import errno, shutil, sys, tempfile, traceback
 import re as remod
 import os, time, datetime, calendar, textwrap, signal, collections
-import imp, socket, urllib
+import imp, socket, urllib, shlex, cStringIO
 
 if os.name == 'nt':
     import windows as platform
@@ -609,6 +609,49 @@  def _sethgexecutable(path):
     global _hgexecutable
     _hgexecutable = path
 
+def shellsplit(cmdline, all=True):
+    '''Split a command line into components
+
+    If ``all`` is true (default), this splits ``cmdline`` into
+    components, and returns list of them.
+
+    Otherwise, this splits ``cmdline`` only at the first shell
+    delimiter character, and returns ``(command, rest)`` tuple.
+
+    Code paths splitting the string from configuration files into each
+    components should use this instead of ``shlex.split``, because the
+    latter loses whether users really want to quote each components or
+    not (see issue4463 for detail).
+
+    >>> shellsplit('foo bar baz')
+    ['foo', 'bar', 'baz']
+    >>> shellsplit('foo', all=False)
+    ('foo', '')
+    >>> shellsplit('foo   bar baz', all=False)
+    ('foo', 'bar baz')
+    >>> shellsplit('"foo foo"', all=False)
+    ('foo foo', '')
+    >>> shellsplit('"foo foo" bar baz', all=False)
+    ('foo foo', 'bar baz')
+    >>> shellsplit('"foo foo"   "bar" baz', all=False)
+    ('foo foo', '"bar" baz')
+    >>> shellsplit('foo "bar" baz', all=False)
+    ('foo', '"bar" baz')
+    >>> shellsplit('"foo"/"foo" "bar" baz', all=False)
+    ('foo/foo', '"bar" baz')
+    '''
+    stream = cStringIO.StringIO(cmdline)
+    # According to "shlex.split" implementation, ``posix`` is True
+    # even on Windows
+    lex = shlex.shlex(stream, posix=True)
+    lex.whitespace_split = True
+    lex.commenters = ''
+
+    if all:
+        return list(lex)
+    else:
+        return (lex.next(), cmdline[stream.tell():].lstrip())
+
 def system(cmd, environ={}, cwd=None, onerr=None, errprefix=None, out=None):
     '''enhanced shell command execution.
     run with environment maybe modified, maybe in different dir.