Patchwork [2,of,2] worker: use os._exit for posix worker in all cases

login
register
mail settings
Submitter Jun Wu
Date Nov. 24, 2016, 1:17 a.m.
Message ID <6c8735bfbf00b224e3f1.1479950270@x1c>
Download mbox | patch
Permalink /patch/17741/
State Accepted
Headers show

Comments

Jun Wu - Nov. 24, 2016, 1:17 a.m.
# HG changeset patch
# User Jun Wu <quark@fb.com>
# Date 1479950134 0
#      Thu Nov 24 01:15:34 2016 +0000
# Node ID 6c8735bfbf00b224e3f158242e1078d0fe667a42
# Parent  c76f0d4bdee6bfbd7bda771d5c05939d1d4cb132
# Available At https://bitbucket.org/quark-zju/hg-draft
#              hg pull https://bitbucket.org/quark-zju/hg-draft -r 6c8735bfbf00
worker: use os._exit for posix worker in all cases

Like commandserver, the worker should never run other resource cleanup logic.

Previously this is not true for workers if they have exceptions other than
KeyboardInterrupt.

This actually caused a real-world deadlock with remotefilelog:

1. remotefilelog/fileserverclient creates a sshpeer. pipei/o/e get created.
2. worker inherits that sshpeer's pipei/o/e.
3. worker runs sshpeer.cleanup (only happens without os._exit)
4. worker closes pipeo/i, which will normally make the sshpeer read EOF from
   its stdin and exit. But the master process still have pipeo, so no EOF.
5. worker reads pipee (stderr of sshpeer), which never completes because
   the ssh process does not exit, does not close its stderr.
6. master waits for all workers, which never completes because they never
   complete sshpeer.cleanup.

This could also be addressed by closing these fds after fork, which is not
easy because Python 2.x does not have an official "afterfork" hook. Hacking
os.fork is also ugly. Besides, sshpeer is probably not the only troublemarker.

The patch changes _posixworker so all its code paths will use os._exit to
avoid running unwanted resource clean-ups.
Jun Wu - Nov. 24, 2016, 1:22 a.m.
The previous version is at https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-August/087058.html
It's 3 months ago so I didn't mark the current one as V2. I changed the
title a bit so whatever automation would handle this correctly.

Excerpts from Jun Wu's message of 2016-11-24 01:17:50 +0000:
> # HG changeset patch
> # User Jun Wu <quark@fb.com>
> # Date 1479950134 0
> #      Thu Nov 24 01:15:34 2016 +0000
> # Node ID 6c8735bfbf00b224e3f158242e1078d0fe667a42
> # Parent  c76f0d4bdee6bfbd7bda771d5c05939d1d4cb132
> # Available At https://bitbucket.org/quark-zju/hg-draft 
> #              hg pull https://bitbucket.org/quark-zju/hg-draft  -r 6c8735bfbf00
> worker: use os._exit for posix worker in all cases
> 
> Like commandserver, the worker should never run other resource cleanup logic.
> 
> Previously this is not true for workers if they have exceptions other than
> KeyboardInterrupt.
> 
> This actually caused a real-world deadlock with remotefilelog:
> 
> 1. remotefilelog/fileserverclient creates a sshpeer. pipei/o/e get created.
> 2. worker inherits that sshpeer's pipei/o/e.
> 3. worker runs sshpeer.cleanup (only happens without os._exit)
> 4. worker closes pipeo/i, which will normally make the sshpeer read EOF from
>    its stdin and exit. But the master process still have pipeo, so no EOF.
> 5. worker reads pipee (stderr of sshpeer), which never completes because
>    the ssh process does not exit, does not close its stderr.
> 6. master waits for all workers, which never completes because they never
>    complete sshpeer.cleanup.
> 
> This could also be addressed by closing these fds after fork, which is not
> easy because Python 2.x does not have an official "afterfork" hook. Hacking
> os.fork is also ugly. Besides, sshpeer is probably not the only troublemarker.
> 
> The patch changes _posixworker so all its code paths will use os._exit to
> avoid running unwanted resource clean-ups.
> 
> diff --git a/mercurial/worker.py b/mercurial/worker.py
> --- a/mercurial/worker.py
> +++ b/mercurial/worker.py
> @@ -16,4 +16,5 @@ from .i18n import _
>  from . import (
>      error,
> +    scmutil,
>      util,
>  )
> @@ -133,13 +134,24 @@ def _posixworker(ui, func, staticargs, a
>              signal.signal(signal.SIGINT, oldhandler)
>              signal.signal(signal.SIGCHLD, oldchldhandler)
> -            try:
> +
> +            def workerfunc():
>                  os.close(rfd)
>                  for i, item in func(*(staticargs + (pargs,))):
>                      os.write(wfd, '%d %s\n' % (i, item))
> -                os._exit(0)
> +
> +            # make sure we use os._exit in all code paths. otherwise the worker
> +            # may do some clean-ups which could cause surprises like deadlock.
> +            # see sshpeer.cleanup for example.
> +            try:
> +                scmutil.callcatch(ui, workerfunc)
>              except KeyboardInterrupt:
>                  os._exit(255)
> -                # other exceptions are allowed to propagate, we rely
> -                # on lock.py's pid checks to avoid release callbacks
> +            except: # never return, therefore no re-raises
> +                try:
> +                    ui.traceback()
> +                finally:
> +                    os._exit(255)
> +            else:
> +                os._exit(0)
>          pids.add(pid)
>      os.close(wfd)
Bryan O'Sullivan - Nov. 29, 2016, 4:30 p.m.
On Wed, Nov 23, 2016 at 5:17 PM, Jun Wu <quark@fb.com> wrote:

> worker: use os._exit for posix worker in all cases
>

This looks good to me. Thanks!

Patch

diff --git a/mercurial/worker.py b/mercurial/worker.py
--- a/mercurial/worker.py
+++ b/mercurial/worker.py
@@ -16,4 +16,5 @@  from .i18n import _
 from . import (
     error,
+    scmutil,
     util,
 )
@@ -133,13 +134,24 @@  def _posixworker(ui, func, staticargs, a
             signal.signal(signal.SIGINT, oldhandler)
             signal.signal(signal.SIGCHLD, oldchldhandler)
-            try:
+
+            def workerfunc():
                 os.close(rfd)
                 for i, item in func(*(staticargs + (pargs,))):
                     os.write(wfd, '%d %s\n' % (i, item))
-                os._exit(0)
+
+            # make sure we use os._exit in all code paths. otherwise the worker
+            # may do some clean-ups which could cause surprises like deadlock.
+            # see sshpeer.cleanup for example.
+            try:
+                scmutil.callcatch(ui, workerfunc)
             except KeyboardInterrupt:
                 os._exit(255)
-                # other exceptions are allowed to propagate, we rely
-                # on lock.py's pid checks to avoid release callbacks
+            except: # never return, therefore no re-raises
+                try:
+                    ui.traceback()
+                finally:
+                    os._exit(255)
+            else:
+                os._exit(0)
         pids.add(pid)
     os.close(wfd)