Patchwork [6,of,8,V5] worker: make sure killworkers runs at most once

login
register
mail settings
Submitter Jun Wu
Date Nov. 15, 2016, 2:39 a.m.
Message ID <8402c91c250a9dd36929.1479177549@x1c>
Download mbox | patch
Permalink /patch/17580/
State Rejected
Headers show

Comments

Jun Wu - Nov. 15, 2016, 2:39 a.m.
# HG changeset patch
# User Jun Wu <quark@fb.com>
# Date 1479176697 0
#      Tue Nov 15 02:24:57 2016 +0000
# Node ID 8402c91c250a9dd369296dcdf00f7b50110ff6ae
# Parent  9dbb3532b173a980f341e41d9e96338a386364e5
# Available At https://bitbucket.org/quark-zju/hg-draft
#              hg pull https://bitbucket.org/quark-zju/hg-draft -r 8402c91c250a
worker: make sure killworkers runs at most once

With the next patch adding SIGCHILD handler calling
waitforworkers(blocking=False), the current code could call "killworkers"
multiple times:

    if st and not problem[0]:
        # at this time, another SIGCHILD happens, killworkers may run twice
        problem[0] = st
        killworkers()

This patch uses the property that "next(itertools.count())" is an "atomic
test and update" operation at the Python code level (cannot be interrupted
by a Python signal handler). That property is true for both CPython
(confirmed from source code) and PyPy (confirmed by PyPy dev arigato on
freenode #pypy channel).
Yuya Nishihara - Nov. 15, 2016, 2:02 p.m.
On Tue, 15 Nov 2016 02:39:09 +0000, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark@fb.com>
> # Date 1479176697 0
> #      Tue Nov 15 02:24:57 2016 +0000
> # Node ID 8402c91c250a9dd369296dcdf00f7b50110ff6ae
> # Parent  9dbb3532b173a980f341e41d9e96338a386364e5
> # Available At https://bitbucket.org/quark-zju/hg-draft
> #              hg pull https://bitbucket.org/quark-zju/hg-draft -r 8402c91c250a
> worker: make sure killworkers runs at most once

> @@ -90,4 +91,5 @@ def _posixworker(ui, func, staticargs, a
>      signal.signal(signal.SIGINT, signal.SIG_IGN)
>      pids, problem = set(), [0]
> +    problemcount = itertools.count()
>      def killworkers():
>          # if one worker bails, there's no good reason to wait for the rest
> @@ -114,5 +116,8 @@ def _posixworker(ui, func, staticargs, a
>                  pids.remove(p)
>                  st = _exitstatus(st)
> -            if st and not problem[0]:
> +            # next(itertools.count()) cannot be interrupted by a Python signal
> +            # handler - true for both CPython and PyPy. So killworkers() runs
> +            # at most once.
> +            if st and next(problemcount) == 0:

I'm not sure if this atomic counter buys. Since we'll remove threading and
unregister SIGCHLD handler before killworkers(), the only concern would be
whether pids are collected by waitpid() or not. If waitforworkers() is
interrupted between os.waitpid() and pids.remove(), the waited pid could be
killed again. But this patch doesn't prevent it.

  two workers; the latter fails:
  # pids = [2, 3]
  SIGCHLD-2
    waitpid(2) -> p = 2, st = ok
    SIGCHLD-3
      waitpid(2) -> ECHLD
      waitpid(3) -> p = 3, st = error
      pids.remove(3)
      # pids = [2]
      killworkers()
    pids.remove(2)

  two workers; both fail:
  # pids = [2, 3]
  SIGCHLD-2
    waitpid(2) -> p = 2, st = error
    SIGCHLD-3
      waitpid(2) -> ECHLD
      waitpid(3) -> p = 3, st = error
      pids.remove(3)
      # pids = [2]
      killworkers()
    pids.remove(2)
    # pids = []
    killworkers()

Maybe we can discard pid if waitpid() fails with ECHLD?

The series looks good. I'm going to queue these patches. If you agree this
atomic counter is somewhat redundant, I'll drop it.
Jun Wu - Nov. 15, 2016, 10:28 p.m.
Excerpts from Yuya Nishihara's message of 2016-11-15 23:02:35 +0900:
> On Tue, 15 Nov 2016 02:39:09 +0000, Jun Wu wrote:
> > # HG changeset patch
> > # User Jun Wu <quark@fb.com>
> > # Date 1479176697 0
> > #      Tue Nov 15 02:24:57 2016 +0000
> > # Node ID 8402c91c250a9dd369296dcdf00f7b50110ff6ae
> > # Parent  9dbb3532b173a980f341e41d9e96338a386364e5
> > # Available At https://bitbucket.org/quark-zju/hg-draft 
> > #              hg pull https://bitbucket.org/quark-zju/hg-draft  -r 8402c91c250a
> > worker: make sure killworkers runs at most once
> 
> > @@ -90,4 +91,5 @@ def _posixworker(ui, func, staticargs, a
> >      signal.signal(signal.SIGINT, signal.SIG_IGN)
> >      pids, problem = set(), [0]
> > +    problemcount = itertools.count()
> >      def killworkers():
> >          # if one worker bails, there's no good reason to wait for the rest
> > @@ -114,5 +116,8 @@ def _posixworker(ui, func, staticargs, a
> >                  pids.remove(p)
> >                  st = _exitstatus(st)
> > -            if st and not problem[0]:
> > +            # next(itertools.count()) cannot be interrupted by a Python signal
> > +            # handler - true for both CPython and PyPy. So killworkers() runs
> > +            # at most once.
> > +            if st and next(problemcount) == 0:
> 
> I'm not sure if this atomic counter buys. Since we'll remove threading and
> unregister SIGCHLD handler before killworkers(), the only concern would be
> whether pids are collected by waitpid() or not. If waitforworkers() is
> interrupted between os.waitpid() and pids.remove(), the waited pid could be
> killed again. But this patch doesn't prevent it.

I agree that the only important thing is whether processes are waited or
not.

And I was aware of the situation you described (so I used the words "making
it harder" instead of "making it impossible" in patch 7).

However, this patch prevents killing the "unwaited" processes multiple
times, which is not that important but is "nice-to-have", I think.

Up to you for decision.

>   two workers; the latter fails:
>   # pids = [2, 3]
>   SIGCHLD-2
>     waitpid(2) -> p = 2, st = ok
>     SIGCHLD-3
>       waitpid(2) -> ECHLD
>       waitpid(3) -> p = 3, st = error
>       pids.remove(3)
>       # pids = [2]
>       killworkers()
>     pids.remove(2)
> 
>   two workers; both fail:
>   # pids = [2, 3]

If pids = [2, 3, 4, 5, 6, 7, 8]. killworkers may kill 4 to 8 twice without
this patch.

>   SIGCHLD-2
>     waitpid(2) -> p = 2, st = error
>     SIGCHLD-3
>       waitpid(2) -> ECHLD
>       waitpid(3) -> p = 3, st = error
>       pids.remove(3)
>       # pids = [2]
>       killworkers()
>     pids.remove(2)
>     # pids = []
>     killworkers()
> 
> Maybe we can discard pid if waitpid() fails with ECHLD?

If ECHILD happens, it means the child is waited by another "waitforworkers",
who is responsible to do "pids.remove(p)". Discarding pid when ECHILD
happens would cause one of them to raise KeyError.

> The series looks good. I'm going to queue these patches. If you agree this
> atomic counter is somewhat redundant, I'll drop it.

Thanks!
Yuya Nishihara - Nov. 16, 2016, 12:25 p.m.
On Tue, 15 Nov 2016 22:28:46 +0000, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-11-15 23:02:35 +0900:
> > On Tue, 15 Nov 2016 02:39:09 +0000, Jun Wu wrote:
> > > # HG changeset patch
> > > # User Jun Wu <quark@fb.com>
> > > # Date 1479176697 0
> > > #      Tue Nov 15 02:24:57 2016 +0000
> > > # Node ID 8402c91c250a9dd369296dcdf00f7b50110ff6ae
> > > # Parent  9dbb3532b173a980f341e41d9e96338a386364e5
> > > # Available At https://bitbucket.org/quark-zju/hg-draft 
> > > #              hg pull https://bitbucket.org/quark-zju/hg-draft  -r 8402c91c250a
> > > worker: make sure killworkers runs at most once
> > 
> > > @@ -90,4 +91,5 @@ def _posixworker(ui, func, staticargs, a
> > >      signal.signal(signal.SIGINT, signal.SIG_IGN)
> > >      pids, problem = set(), [0]
> > > +    problemcount = itertools.count()
> > >      def killworkers():
> > >          # if one worker bails, there's no good reason to wait for the rest
> > > @@ -114,5 +116,8 @@ def _posixworker(ui, func, staticargs, a
> > >                  pids.remove(p)
> > >                  st = _exitstatus(st)
> > > -            if st and not problem[0]:
> > > +            # next(itertools.count()) cannot be interrupted by a Python signal
> > > +            # handler - true for both CPython and PyPy. So killworkers() runs
> > > +            # at most once.
> > > +            if st and next(problemcount) == 0:
> > 
> > I'm not sure if this atomic counter buys. Since we'll remove threading and
> > unregister SIGCHLD handler before killworkers(), the only concern would be
> > whether pids are collected by waitpid() or not. If waitforworkers() is
> > interrupted between os.waitpid() and pids.remove(), the waited pid could be
> > killed again. But this patch doesn't prevent it.
> 
> I agree that the only important thing is whether processes are waited or
> not.
> 
> And I was aware of the situation you described (so I used the words "making
> it harder" instead of "making it impossible" in patch 7).
> 
> However, this patch prevents killing the "unwaited" processes multiple
> times, which is not that important but is "nice-to-have", I think.
> 
> Up to you for decision.
> 
> >   two workers; the latter fails:
> >   # pids = [2, 3]
> >   SIGCHLD-2
> >     waitpid(2) -> p = 2, st = ok
> >     SIGCHLD-3
> >       waitpid(2) -> ECHLD
> >       waitpid(3) -> p = 3, st = error
> >       pids.remove(3)
> >       # pids = [2]
> >       killworkers()
> >     pids.remove(2)
> > 
> >   two workers; both fail:
> >   # pids = [2, 3]
> 
> If pids = [2, 3, 4, 5, 6, 7, 8]. killworkers may kill 4 to 8 twice without
> this patch.

Yes, but killing active/zombie processes twice would be harmless. And I don't
think it's likely to happen enough we have to take special care with the atomic
counter.

> >   SIGCHLD-2
> >     waitpid(2) -> p = 2, st = error
> >     SIGCHLD-3
> >       waitpid(2) -> ECHLD
> >       waitpid(3) -> p = 3, st = error
> >       pids.remove(3)
> >       # pids = [2]
> >       killworkers()
> >     pids.remove(2)
> >     # pids = []
> >     killworkers()
> > 
> > Maybe we can discard pid if waitpid() fails with ECHLD?
> 
> If ECHILD happens, it means the child is waited by another "waitforworkers",
> who is responsible to do "pids.remove(p)". Discarding pid when ECHILD
> happens would cause one of them to raise Key Error.

I meant set.discard(). My idea is:

 a. discard pid by anyone who notices it no longer exists
 b. then, kill the rest

something like this:

  for pid in pids.copy():
      p, st = waitpid(pid, WNOHANG)
      if ECHILD:
          pids.discard(p)
      if p:
          pids.discard(p)
      if st:
          problem[0] = st
  if problem[0]:
      restore_signal()
      killworkers()
Jun Wu - Nov. 16, 2016, 1:57 p.m.
Excerpts from Yuya Nishihara's message of 2016-11-16 21:25:45 +0900:
> I meant set.discard(). My idea is:
> 
>  a. discard pid by anyone who notices it no longer exists
>  b. then, kill the rest

This looks good. Although it's still possible to kill processes multiple
times (because atomic "wait-and-drop-pid" is impossible).

> something like this:
> 
>   for pid in pids.copy():
>       p, st = waitpid(pid, WNOHANG)
>       if ECHILD:
>           pids.discard(p)
>       if p:
>           pids.discard(p)
>       if st:
>           problem[0] = st
>   if problem[0]:
>       restore_signal()
>       killworkers()
Yuya Nishihara - Nov. 16, 2016, 3 p.m.
On Wed, 16 Nov 2016 13:57:58 +0000, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-11-16 21:25:45 +0900:
> > I meant set.discard(). My idea is:
> > 
> >  a. discard pid by anyone who notices it no longer exists
> >  b. then, kill the rest
> 
> This looks good. Although it's still possible to kill processes multiple
> times (because atomic "wait-and-drop-pid" is impossible).

Hmm, signal handlers are stacked at arbitrary code location, there's no
time-based preemption like threading, I think. So even if the wait-discard(p)
sequence is interrupted by another waitforworkers(), the process "p" would
be discarded before killworkers() thanks to ECHILD. And we do unregister the
SIGCHLD handler before killworkers().

I'll queue the series without the patch 6.
Jun Wu - Nov. 16, 2016, 3:11 p.m.
Excerpts from Yuya Nishihara's message of 2016-11-17 00:00:01 +0900:
> On Wed, 16 Nov 2016 13:57:58 +0000, Jun Wu wrote:
> > Excerpts from Yuya Nishihara's message of 2016-11-16 21:25:45 +0900:
> > > I meant set.discard(). My idea is:
> > > 
> > >  a. discard pid by anyone who notices it no longer exists
> > >  b. then, kill the rest
> > 
> > This looks good. Although it's still possible to kill processes multiple
> > times (because atomic "wait-and-drop-pid" is impossible).
> 
> Hmm, signal handlers are stacked at arbitrary code location, there's no
> time-based preemption like threading, I think. So even if the wait-discard(p)
> sequence is interrupted by another waitforworkers(), the process "p" would
> be discarded before killworkers() thanks to ECHILD. And we do unregister the
> SIGCHLD handler before killworkers().
> 
> I'll queue the series without the patch 6.

That will break Python < 2.7.4 if iterfile workaround is not ready.
Hopefully I can figure out the final solution today.

Worst case, we test Python version and fallback to the old code in worker.py
Jun Wu - Nov. 16, 2016, 3:18 p.m.
Excerpts from Yuya Nishihara's message of 2016-11-17 00:00:01 +0900:
> On Wed, 16 Nov 2016 13:57:58 +0000, Jun Wu wrote:
> > Excerpts from Yuya Nishihara's message of 2016-11-16 21:25:45 +0900:
> > > I meant set.discard(). My idea is:
> > > 
> > >  a. discard pid by anyone who notices it no longer exists
> > >  b. then, kill the rest
> > 
> > This looks good. Although it's still possible to kill processes multiple
> > times (because atomic "wait-and-drop-pid" is impossible).
> 
> Hmm, signal handlers are stacked at arbitrary code location, there's no
> time-based preemption like threading, I think. So even if the wait-discard(p)
> sequence is interrupted by another waitforworkers(), the process "p" would

Only one "p" will be discard. There could be multiple "p"s that need to be
discarded.

  waitforworkers
    p1: need to be discarded
      SIGCHLD
        waitforworkers
        p2: need to be discarded
      SIGCHLD
        waitforworkers
        p3: need to be discarded
      SIGCHLD
        waitforworkers
        p4: need to be discarded
      ....
    discard p1
    killworkers
      # kill p2, p3, p4 unnecessarily

As long as there is no way to do "atomic waitpid + set.discard", the problem
can not be solved perfectly.

> be discarded before killworkers() thanks to ECHILD. And we do unregister the
> SIGCHLD handler before killworkers().
> 
> I'll queue the series without the patch 6.
Yuya Nishihara - Nov. 16, 2016, 3:27 p.m.
On Wed, 16 Nov 2016 15:11:27 +0000, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-11-17 00:00:01 +0900:
> > On Wed, 16 Nov 2016 13:57:58 +0000, Jun Wu wrote:
> > > Excerpts from Yuya Nishihara's message of 2016-11-16 21:25:45 +0900:
> > > > I meant set.discard(). My idea is:
> > > > 
> > > >  a. discard pid by anyone who notices it no longer exists
> > > >  b. then, kill the rest
> > > 
> > > This looks good. Although it's still possible to kill processes multiple
> > > times (because atomic "wait-and-drop-pid" is impossible).
> > 
> > Hmm, signal handlers are stacked at arbitrary code location, there's no
> > time-based preemption like threading, I think. So even if the wait-discard(p)
> > sequence is interrupted by another waitforworkers(), the process "p" would
> > be discarded before killworkers() thanks to ECHILD. And we do unregister the
> > SIGCHLD handler before killworkers().
> > 
> > I'll queue the series without the patch 6.
> 
> That will break Python < 2.7.4 if iterfile workaround is not ready.
> Hopefully I can figure out the final solution today.

Oops, I've just pushed this. Hope you'll fix the problem soon.
Yuya Nishihara - Nov. 16, 2016, 3:34 p.m.
On Wed, 16 Nov 2016 15:18:07 +0000, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-11-17 00:00:01 +0900:
> > On Wed, 16 Nov 2016 13:57:58 +0000, Jun Wu wrote:
> > > Excerpts from Yuya Nishihara's message of 2016-11-16 21:25:45 +0900:
> > > > I meant set.discard(). My idea is:
> > > > 
> > > >  a. discard pid by anyone who notices it no longer exists
> > > >  b. then, kill the rest
> > > 
> > > This looks good. Although it's still possible to kill processes multiple
> > > times (because atomic "wait-and-drop-pid" is impossible).
> > 
> > Hmm, signal handlers are stacked at arbitrary code location, there's no
> > time-based preemption like threading, I think. So even if the wait-discard(p)
> > sequence is interrupted by another waitforworkers(), the process "p" would
> 
> Only one "p" will be discard. There could be multiple "p"s that need to be
> discarded.
> 
>   waitforworkers
>     p1: need to be discarded
>       SIGCHLD
>         waitforworkers
>         p2: need to be discarded
>       SIGCHLD
>         waitforworkers

Here waitforworkers() should discard p2 because waitpid(p2) would raise ECHILD.

>         p3: need to be discarded
>       SIGCHLD
>         waitforworkers
>         p4: need to be discarded
>       ....
>     discard p1
>     killworkers
>       # kill p2, p3, p4 unnecessarily
> 
> As long as there is no way to do "atomic waitpid + set.discard", the problem
> can not be solved perfectly.
> 
> > be discarded before killworkers() thanks to ECHILD. And we do unregister the
> > SIGCHLD handler before killworkers().
Jun Wu - Nov. 16, 2016, 3:52 p.m.
Excerpts from Yuya Nishihara's message of 2016-11-17 00:34:28 +0900:
> On Wed, 16 Nov 2016 15:18:07 +0000, Jun Wu wrote:
> > Excerpts from Yuya Nishihara's message of 2016-11-17 00:00:01 +0900:
> > > On Wed, 16 Nov 2016 13:57:58 +0000, Jun Wu wrote:
> > > > Excerpts from Yuya Nishihara's message of 2016-11-16 21:25:45 +0900:
> > > > > I meant set.discard(). My idea is:
> > > > > 
> > > > >  a. discard pid by anyone who notices it no longer exists
> > > > >  b. then, kill the rest
> > > > 
> > > > This looks good. Although it's still possible to kill processes multiple
> > > > times (because atomic "wait-and-drop-pid" is impossible).
> > > 
> > > Hmm, signal handlers are stacked at arbitrary code location, there's no
> > > time-based preemption like threading, I think. So even if the wait-discard(p)
> > > sequence is interrupted by another waitforworkers(), the process "p" would
> > 
> > Only one "p" will be discard. There could be multiple "p"s that need to be
> > discarded.
> > 
> >   waitforworkers
> >     p1: need to be discarded
> >       SIGCHLD
> >         waitforworkers
> >         p2: need to be discarded
> >       SIGCHLD
> >         waitforworkers
> 
> Here waitforworkers() should discard p2 because waitpid(p2) would raise ECHILD.

You're correct in this case. It's a bit more complex because the iteration
order of a "set" is undefined. "set.discard" may change the order, so the
nested "waitforworkers" may not visit the same pid visited by the outer
"waitforworkers". sorted(pids.copy()) would guarantee the order. If we take
all known measures (itertools.count, set.discard, sorted), it seems we can
make sure there is at most 1 process who will receive an unnecessary kill.

I think the current code is good enough in terms of correctness. It's also a
bit stricter - only successful waitpid will remove p from the set - if a
(unrealistic) buggy OS allows us to wait a same pid twice, we will complain.
I'm going to leave it as-is.

> >         p3: need to be discarded
> >       SIGCHLD
> >         waitforworkers
> >         p4: need to be discarded
> >       ....
> >     discard p1
> >     killworkers
> >       # kill p2, p3, p4 unnecessarily
> > 
> > As long as there is no way to do "atomic waitpid + set.discard", the problem
> > can not be solved perfectly.
> > 
> > > be discarded before killworkers() thanks to ECHILD. And we do unregister the
> > > SIGCHLD handler before killworkers().
Yuya Nishihara - Nov. 17, 2016, 1:16 p.m.
On Wed, 16 Nov 2016 15:52:24 +0000, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-11-17 00:34:28 +0900:
> > On Wed, 16 Nov 2016 15:18:07 +0000, Jun Wu wrote:
> > > Excerpts from Yuya Nishihara's message of 2016-11-17 00:00:01 +0900:
> > > > On Wed, 16 Nov 2016 13:57:58 +0000, Jun Wu wrote:
> > > > > Excerpts from Yuya Nishihara's message of 2016-11-16 21:25:45 +0900:
> > > > > > I meant set.discard(). My idea is:
> > > > > > 
> > > > > >  a. discard pid by anyone who notices it no longer exists
> > > > > >  b. then, kill the rest
> > > > > 
> > > > > This looks good. Although it's still possible to kill processes multiple
> > > > > times (because atomic "wait-and-drop-pid" is impossible).
> > > > 
> > > > Hmm, signal handlers are stacked at arbitrary code location, there's no
> > > > time-based preemption like threading, I think. So even if the wait-discard(p)
> > > > sequence is interrupted by another waitforworkers(), the process "p" would
> > > 
> > > Only one "p" will be discard. There could be multiple "p"s that need to be
> > > discarded.
> > > 
> > >   waitforworkers
> > >     p1: need to be discarded
> > >       SIGCHLD
> > >         waitforworkers
> > >         p2: need to be discarded
> > >       SIGCHLD
> > >         waitforworkers
> > 
> > Here waitforworkers() should discard p2 because waitpid(p2) would raise ECHILD.
> 
> You're correct in this case. It's a bit more complex because the iteration
> order of a "set" is undefined. "set.discard" may change the order, so the
> nested "waitforworkers" may not visit the same pid visited by the outer
> "waitforworkers". sorted(pids.copy()) would guarantee the order. If we take
> all known measures (itertools.count, set.discard, sorted), it seems we can
> make sure there is at most 1 process who will receive an unnecessary kill.

The order doesn't matter. The inner waitforworkers() will check all pids
before killing some of them. I'll send patches soon.

Patch

diff --git a/mercurial/worker.py b/mercurial/worker.py
--- a/mercurial/worker.py
+++ b/mercurial/worker.py
@@ -9,4 +9,5 @@  from __future__ import absolute_import
 
 import errno
+import itertools
 import os
 import signal
@@ -90,4 +91,5 @@  def _posixworker(ui, func, staticargs, a
     signal.signal(signal.SIGINT, signal.SIG_IGN)
     pids, problem = set(), [0]
+    problemcount = itertools.count()
     def killworkers():
         # if one worker bails, there's no good reason to wait for the rest
@@ -114,5 +116,8 @@  def _posixworker(ui, func, staticargs, a
                 pids.remove(p)
                 st = _exitstatus(st)
-            if st and not problem[0]:
+            # next(itertools.count()) cannot be interrupted by a Python signal
+            # handler - true for both CPython and PyPy. So killworkers() runs
+            # at most once.
+            if st and next(problemcount) == 0:
                 problem[0] = st
                 killworkers()