Patchwork [1,of,2,STABLE] worker: be silent if killed by the main process

login
register
mail settings
Submitter Jun Wu
Date April 20, 2017, 4:18 p.m.
Message ID <f0bd43ca0d5659a1e49b.1492705081@x1c>
Download mbox | patch
Permalink /patch/20273/
State Changes Requested
Headers show

Comments

Jun Wu - April 20, 2017, 4:18 p.m.
# HG changeset patch
# User Jun Wu <quark@fb.com>
# Date 1492704007 25200
#      Thu Apr 20 09:00:07 2017 -0700
# Node ID f0bd43ca0d5659a1e49bf4948a9d3d0f30a69e34
# Parent  59afb0750aecaff6c2b2e4edaab04eb91eca8a88
# Available At https://bitbucket.org/quark-zju/hg-draft
#              hg pull https://bitbucket.org/quark-zju/hg-draft -r f0bd43ca0d56
worker: be silent if killed by the main process

Previously, worker.py will send SIGTERM to remaining workers once a failure
of another worker is noticed. Workers receiving SIGTERM will go through
scmutil.callcatch, and may have extra outputs like "killed!" and a
traceback, which breaks test-worker.t.

Besides, it's hard to predicate how many "killed!" or traceback should be
printed if we keep that behavior.

Since the main process will notice a worker failure and handle it
accordingly, it makes sense to kill the remaining workers silently.
Therefore this patch adds a backdoor signal - SIGUSR2 to allow a silent
SIGTERM kill.

This makes test-worker.t less flaky. The next patch will make test-worker.t
stronger (and easier to break without this patch).
Yuya Nishihara - April 21, 2017, 11:43 a.m.
On Thu, 20 Apr 2017 09:18:01 -0700, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark@fb.com>
> # Date 1492704007 25200
> #      Thu Apr 20 09:00:07 2017 -0700
> # Node ID f0bd43ca0d5659a1e49bf4948a9d3d0f30a69e34
> # Parent  59afb0750aecaff6c2b2e4edaab04eb91eca8a88
> # Available At https://bitbucket.org/quark-zju/hg-draft
> #              hg pull https://bitbucket.org/quark-zju/hg-draft -r f0bd43ca0d56
> worker: be silent if killed by the main process
> 
> Previously, worker.py will send SIGTERM to remaining workers once a failure
> of another worker is noticed. Workers receiving SIGTERM will go through
> scmutil.callcatch, and may have extra outputs like "killed!" and a
> traceback, which breaks test-worker.t.

Can't we simply move the SignalInterrupt handling to dispatch? SignalInterrupt
is a sub-type of KeyboardInterrupt, so it doesn't make sense to handle only
SignalInterrupt in scmutil.callcatch().

I noticed "interrupted!" message isn't printed when worker is involved, but
that is a minor issue we don't have to fix in the stable release.

> Besides, it's hard to predicate how many "killed!" or traceback should be
> printed if we keep that behavior.

Well, too many "killed!" messages are bad, but traceback is for developers.
I don't think it should be suppressed just for test stability.

> Since the main process will notice a worker failure and handle it
> accordingly, it makes sense to kill the remaining workers silently.
> Therefore this patch adds a backdoor signal - SIGUSR2 to allow a silent
> SIGTERM kill.
> 
> This makes test-worker.t less flaky. The next patch will make test-worker.t
> stronger (and easier to break without this patch).
> 
> diff --git a/mercurial/worker.py b/mercurial/worker.py
> --- a/mercurial/worker.py
> +++ b/mercurial/worker.py
> @@ -100,4 +100,6 @@ def _posixworker(ui, func, staticargs, a
>          for p in pids:
>              try:
> +                # SIGUSR2 makes a worker silent for the upcoming SIGTERM
> +                os.kill(p, signal.SIGUSR2)
>                  os.kill(p, signal.SIGTERM)
>              except OSError as err:
> @@ -135,4 +137,11 @@ def _posixworker(ui, func, staticargs, a
>      oldchldhandler = signal.signal(signal.SIGCHLD, sigchldhandler)
>      ui.flush()
> +    # when the main process notices an error (non-0 exit code), remaining
> +    # workers will be killed without causing noisy output ("killed!" and a
> +    # traceback). this is done by a special SIGUSR2.
> +    def silentsigterm(signum, frame):
> +        # avoid scmutil.callcatch code path which could be noisy
> +        signal.signal(signal.SIGTERM, signal.SIG_DFL)

The default handler of SIGTERM kills the process, which shouldn't be used.
Jun Wu - April 21, 2017, 4:06 p.m.
Excerpts from Yuya Nishihara's message of 2017-04-21 20:43:48 +0900:
> Can't we simply move the SignalInterrupt handling to dispatch?
> SignalInterrupt is a sub-type of KeyboardInterrupt, so it doesn't make
> sense to handle only SignalInterrupt in scmutil.callcatch().

Signal could arrive before scmutil.callcatch, just after os.fork().

> I noticed "interrupted!" message isn't printed when worker is involved,
> but that is a minor issue we don't have to fix in the stable release.

The test fails constantly in our CI. But I can send V2 after freeze.
 
> > Besides, it's hard to predicate how many "killed!" or traceback should be
> > printed if we keep that behavior.
> 
> Well, too many "killed!" messages are bad, but traceback is for developers.
> I don't think it should be suppressed just for test stability.
> 
> The default handler of SIGTERM kills the process, which shouldn't be used.

The new plan is to use SIGUSR2 to "kill" workers, SIGUSR2 handler is to do
sys.exit. So "killed!" is removed and traceback is kept.
Augie Fackler - April 21, 2017, 7:41 p.m.
On Fri, Apr 21, 2017 at 09:06:44AM -0700, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2017-04-21 20:43:48 +0900:
> > Can't we simply move the SignalInterrupt handling to dispatch?
> > SignalInterrupt is a sub-type of KeyboardInterrupt, so it doesn't make
> > sense to handle only SignalInterrupt in scmutil.callcatch().
>
> Signal could arrive before scmutil.callcatch, just after os.fork().
>
> > I noticed "interrupted!" message isn't printed when worker is involved,
> > but that is a minor issue we don't have to fix in the stable release.
>
> The test fails constantly in our CI. But I can send V2 after freeze.
>
> > > Besides, it's hard to predicate how many "killed!" or traceback should be
> > > printed if we keep that behavior.
> >
> > Well, too many "killed!" messages are bad, but traceback is for developers.
> > I don't think it should be suppressed just for test stability.
> >
> > The default handler of SIGTERM kills the process, which shouldn't be used.
>
> The new plan is to use SIGUSR2 to "kill" workers, SIGUSR2 handler is to do
> sys.exit. So "killed!" is removed and traceback is kept.

Should we be planning to take these for 4.2, or wait and do them in early 4.3 cycle?

> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Jun Wu - April 21, 2017, 7:43 p.m.
Excerpts from Augie Fackler's message of 2017-04-21 15:41:39 -0400:
> > The new plan is to use SIGUSR2 to "kill" workers, SIGUSR2 handler is to do
> > sys.exit. So "killed!" is removed and traceback is kept.
> 
> Should we be planning to take these for 4.2, or wait and do them in early
> 4.3 cycle?

Probably after freeze. It mainly affects CI.
Augie Fackler - April 21, 2017, 7:44 p.m.
On Fri, Apr 21, 2017 at 12:43:36PM -0700, Jun Wu wrote:
> Excerpts from Augie Fackler's message of 2017-04-21 15:41:39 -0400:
> > > The new plan is to use SIGUSR2 to "kill" workers, SIGUSR2 handler is to do
> > > sys.exit. So "killed!" is removed and traceback is kept.
> >
> > Should we be planning to take these for 4.2, or wait and do them in early
> > 4.3 cycle?
>
> Probably after freeze. It mainly affects CI.

Works for me. I'll drop them for now then.
Yuya Nishihara - April 22, 2017, 3:18 a.m.
On Fri, 21 Apr 2017 09:06:44 -0700, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2017-04-21 20:43:48 +0900:
> > Can't we simply move the SignalInterrupt handling to dispatch?
> > SignalInterrupt is a sub-type of KeyboardInterrupt, so it doesn't make
> > sense to handle only SignalInterrupt in scmutil.callcatch().
> 
> Signal could arrive before scmutil.callcatch, just after os.fork().

Yeah, maybe that's why SIGINT is temporarily masked.

> > I noticed "interrupted!" message isn't printed when worker is involved,
> > but that is a minor issue we don't have to fix in the stable release.
> 
> The test fails constantly in our CI. But I can send V2 after freeze.
>  
> > > Besides, it's hard to predicate how many "killed!" or traceback should be
> > > printed if we keep that behavior.
> > 
> > Well, too many "killed!" messages are bad, but traceback is for developers.
> > I don't think it should be suppressed just for test stability.
> > 
> > The default handler of SIGTERM kills the process, which shouldn't be used.
> 
> The new plan is to use SIGUSR2 to "kill" workers, SIGUSR2 handler is to do
> sys.exit. So "killed!" is removed and traceback is kept.

What's the key difference from just making SIGTERM handler silent?
Jun Wu - April 22, 2017, 4:31 a.m.
Excerpts from Yuya Nishihara's message of 2017-04-22 12:18:52 +0900:
> > The new plan is to use SIGUSR2 to "kill" workers, SIGUSR2 handler is to do
> > sys.exit. So "killed!" is removed and traceback is kept.
> 
> What's the key difference from just making SIGTERM handler silent?

No "killed!" is printed. And users killing them manually will still get
"killed!".
Yuya Nishihara - April 22, 2017, 3:55 p.m.
On Fri, 21 Apr 2017 21:31:05 -0700, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2017-04-22 12:18:52 +0900:
> > > The new plan is to use SIGUSR2 to "kill" workers, SIGUSR2 handler is to do
> > > sys.exit. So "killed!" is removed and traceback is kept.
> > 
> > What's the key difference from just making SIGTERM handler silent?
> 
> No "killed!" is printed. And users killing them manually will still get
> "killed!".

I think that is the same sort of problem as SIGINT, which is no "interrupted!"
is printed right now but we can't make worker processes print it. I guess
both SIGINT and SIGTERM issues can be solved in the same way.

That said, there should be no problem using SIGUSR2 for more explicit control
of worker processes.
Jun Wu - April 22, 2017, 8:34 p.m.
Excerpts from Yuya Nishihara's message of 2017-04-23 00:55:45 +0900:
> On Fri, 21 Apr 2017 21:31:05 -0700, Jun Wu wrote:
> > Excerpts from Yuya Nishihara's message of 2017-04-22 12:18:52 +0900:
> > > > The new plan is to use SIGUSR2 to "kill" workers, SIGUSR2 handler is to do
> > > > sys.exit. So "killed!" is removed and traceback is kept.
> > > 
> > > What's the key difference from just making SIGTERM handler silent?
> > 
> > No "killed!" is printed. And users killing them manually will still get
> > "killed!".
> 
> I think that is the same sort of problem as SIGINT, which is no
> "interrupted!" is printed right now but we can't make worker processes
> print it. I guess both SIGINT and SIGTERM issues can be solved in the same
> way.

I finally get it. tl;dr I prefer moving SignalInterrupt to dispatch.py and
will do that.

The two approaches only differ in the case where a manual SIGTERM hits a
worker, the SIGUSR2 way will print "killed!" as it was before. But I think
not printing "killed!" in all cases is cleaner.

> That said, there should be no problem using SIGUSR2 for more explicit
> control of worker processes.
Jun Wu - April 22, 2017, 8:54 p.m.
Excerpts from Jun Wu's message of 2017-04-22 13:34:52 -0700:
> Excerpts from Yuya Nishihara's message of 2017-04-23 00:55:45 +0900:
> > On Fri, 21 Apr 2017 21:31:05 -0700, Jun Wu wrote:
> > > Excerpts from Yuya Nishihara's message of 2017-04-22 12:18:52 +0900:
> > > > > The new plan is to use SIGUSR2 to "kill" workers, SIGUSR2 handler is to do
> > > > > sys.exit. So "killed!" is removed and traceback is kept.
> > > > 
> > > > What's the key difference from just making SIGTERM handler silent?
> > > 
> > > No "killed!" is printed. And users killing them manually will still get
> > > "killed!".
> > 
> > I think that is the same sort of problem as SIGINT, which is no
> > "interrupted!" is printed right now but we can't make worker processes
> > print it. I guess both SIGINT and SIGTERM issues can be solved in the same
> > way.
> 
> I finally get it. tl;dr I prefer moving SignalInterrupt to dispatch.py and
> will do that.

But I suspect if that works. Since the signal could arrive before the main
"try" block in the worker process. "killed!" may still be printed because
it's caught by dispatch.py and the worker "try except" code is not executed
yet.

I also think the current worker code could escape the "always os._exit"
assumption if a signal hits before "try".

So it seems we want to set a global exception handler before calling
os.fork(). I'll do that instead.

> The two approaches only differ in the case where a manual SIGTERM hits a
> worker, the SIGUSR2 way will print "killed!" as it was before. But I think
> not printing "killed!" in all cases is cleaner.
> 
> > That said, there should be no problem using SIGUSR2 for more explicit
> > control of worker processes.
Jun Wu - April 23, 2017, 12:20 a.m.
Excerpts from Jun Wu's message of 2017-04-22 13:54:50 -0700:
> > I finally get it. tl;dr I prefer moving SignalInterrupt to dispatch.py
> > and will do that.
> 
> But I suspect if that works. Since the signal could arrive before the main
> "try" block in the worker process. "killed!" may still be printed because
> it's caught by dispatch.py and the worker "try except" code is not executed
> yet.
>
> I also think the current worker code could escape the "always os._exit"
> assumption if a signal hits before "try".

FWIW, I had a fix ready: https://bpaste.net/show/2dbbed3fa647 which I think 
is the most correct way to fix all issues discovered in this thread. It does
not use SIGUSR2. I'll send them after freeze.
 
> So it seems we want to set a global exception handler before calling
> os.fork(). I'll do that instead.
> 
> > The two approaches only differ in the case where a manual SIGTERM hits a
> > worker, the SIGUSR2 way will print "killed!" as it was before. But I think
> > not printing "killed!" in all cases is cleaner.
> > 
> > > That said, there should be no problem using SIGUSR2 for more explicit
> > > control of worker processes.
Yuya Nishihara - April 23, 2017, 10:56 a.m.
On Sat, 22 Apr 2017 17:20:14 -0700, Jun Wu wrote:
> Excerpts from Jun Wu's message of 2017-04-22 13:54:50 -0700:
> > > I finally get it. tl;dr I prefer moving SignalInterrupt to dispatch.py
> > > and will do that.
> > 
> > But I suspect if that works. Since the signal could arrive before the main
> > "try" block in the worker process. "killed!" may still be printed because
> > it's caught by dispatch.py and the worker "try except" code is not executed
> > yet.
> >
> > I also think the current worker code could escape the "always os._exit"
> > assumption if a signal hits before "try".
> 
> FWIW, I had a fix ready: https://bpaste.net/show/2dbbed3fa647 which I think 
> is the most correct way to fix all issues discovered in this thread. It does
> not use SIGUSR2. I'll send them after freeze.

These look good to me, thanks. I think the first patch is simple enough to
include in stable if that helps solving the CI issue.

A few nits:

 - pid could be initialized to a null value instead of comparing to parentpid

   pid = -1
   try:
       pid = os.fork()
       ...
   finally:
       if pid == 0:
           os._exit()

 - BaseException can be used to catch any exception object
Jun Wu - April 23, 2017, 12:08 p.m.
Excerpts from Yuya Nishihara's message of 2017-04-23 19:56:13 +0900:
> These look good to me, thanks. I think the first patch is simple enough to
> include in stable if that helps solving the CI issue.
> 
> A few nits:
> 
>  - pid could be initialized to a null value instead of comparing to parentpid
> 
>    pid = -1
>    try:
>        pid = os.fork()
>        ...
>    finally:
>        if pid == 0:

This won't pass the test. If fork() is completed, but "pid = os.fork()"
assignment is not, pid still have the old value and will have trouble. I
think it's safer to not assume whether Python implementation will make it
atomic or not.

>            os._exit()
> 
>  - BaseException can be used to catch any exception object

According to http://stackoverflow.com/a/7161517 , it will miss Python2
old-style classes.
Yuya Nishihara - April 23, 2017, 1:57 p.m.
On Sun, 23 Apr 2017 05:08:48 -0700, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2017-04-23 19:56:13 +0900:
> > These look good to me, thanks. I think the first patch is simple enough to
> > include in stable if that helps solving the CI issue.
> > 
> > A few nits:
> > 
> >  - pid could be initialized to a null value instead of comparing to parentpid
> > 
> >    pid = -1
> >    try:
> >        pid = os.fork()
> >        ...
> >    finally:
> >        if pid == 0:
> 
> This won't pass the test. If fork() is completed, but "pid = os.fork()"
> assignment is not, pid still have the old value and will have trouble. I
> think it's safer to not assume whether Python implementation will make it
> atomic or not.

Maybe you're right. "call" and "store" are separated at interpreter level.

> >            os._exit()
> > 
> >  - BaseException can be used to catch any exception object
> 
> According to http://stackoverflow.com/a/7161517 , it will miss Python2
> old-style classes.

Doh.

Patch

diff --git a/mercurial/worker.py b/mercurial/worker.py
--- a/mercurial/worker.py
+++ b/mercurial/worker.py
@@ -100,4 +100,6 @@  def _posixworker(ui, func, staticargs, a
         for p in pids:
             try:
+                # SIGUSR2 makes a worker silent for the upcoming SIGTERM
+                os.kill(p, signal.SIGUSR2)
                 os.kill(p, signal.SIGTERM)
             except OSError as err:
@@ -135,4 +137,11 @@  def _posixworker(ui, func, staticargs, a
     oldchldhandler = signal.signal(signal.SIGCHLD, sigchldhandler)
     ui.flush()
+    # when the main process notices an error (non-0 exit code), remaining
+    # workers will be killed without causing noisy output ("killed!" and a
+    # traceback). this is done by a special SIGUSR2.
+    def silentsigterm(signum, frame):
+        # avoid scmutil.callcatch code path which could be noisy
+        signal.signal(signal.SIGTERM, signal.SIG_DFL)
+    oldusr2handler = signal.signal(signal.SIGUSR2, silentsigterm)
     for pargs in partition(args, workers):
         pid = os.fork()
@@ -167,4 +176,5 @@  def _posixworker(ui, func, staticargs, a
                 os._exit(ret & 255)
         pids.add(pid)
+    signal.signal(signal.SIGUSR2, oldusr2handler)
     os.close(wfd)
     fp = os.fdopen(rfd, pycompat.sysstr('rb'), 0)