Patchwork [V2] server: add an error feedback mechanism for when the daemon fails to launch

login
register
mail settings
Submitter Matt Harbison
Date March 31, 2018, 1:34 a.m.
Message ID <74bbfa1b88c69471713b.1522460042@Envy>
Download mbox | patch
Permalink /patch/30032/
State Accepted
Headers show

Comments

Matt Harbison - March 31, 2018, 1:34 a.m.
# HG changeset patch
# User Matt Harbison <matt_harbison@yahoo.com>
# Date 1522210269 14400
#      Wed Mar 28 00:11:09 2018 -0400
# Node ID 74bbfa1b88c69471713bcaa0c8124df4ce93783c
# Parent  b9dd8403d8ff4b36847ce405c0db127ca64491a1
server: add an error feedback mechanism for when the daemon fails to launch

There's a recurring problem on Windows where `hg serve -d` will randomly fail to
spawn a detached process.  The reason for the failure is completely hidden, and
it takes hours to get a single failure on my laptop.  All this does is redirect
stdout/stderr of the child to a file until the lock file is freed, and then the
parent dumps it out if it fails to spawn.

I chose to put the output into the lock file because that is always cleaned up.
There's no way to report errors after that anyway.  On Windows, killdaemons.py
is roughly `kill -9`, so this ensures that junk won't pile up.

This may end up being a case of EADDRINUSE.  At least that's what I saw spit out
a few times (among other odd errors and missing output on Windows).  But I also
managed to get the same thing on Fedora 26 by running test-hgwebdir.t with
--loop -j10 for several hours.  Running `netstat` immediately after killing that
run printed a wall of sockets in the TIME_WAIT state, which were gone a couple
seconds later.  I couldn't match up ports that failed, because --loop doesn't
print out the message about the port that was used.  So maybe the fix is to
rotate the use of HGPORT[12] in the tests.  But, let's collect some more data
first.
Yuya Nishihara - March 31, 2018, 2:20 a.m.
On Fri, 30 Mar 2018 21:34:02 -0400, Matt Harbison wrote:
> # HG changeset patch
> # User Matt Harbison <matt_harbison@yahoo.com>
> # Date 1522210269 14400
> #      Wed Mar 28 00:11:09 2018 -0400
> # Node ID 74bbfa1b88c69471713bcaa0c8124df4ce93783c
> # Parent  b9dd8403d8ff4b36847ce405c0db127ca64491a1
> server: add an error feedback mechanism for when the daemon fails to launch

Queued, thanks.

> +    # When daemonized on Windows, redirect stdout/stderr to the lockfile (which
> +    # gets cleaned up after the child is up and running), so that the parent can
> +    # read and print the error if this child dies early.  See 594dd384803c.  On
> +    # other platforms, the child can write to the parent's stdio directly, until
> +    # it is redirected prior to runfn().
> +    if pycompat.iswindows and opts['daemon_postexec']:
> +        for inst in opts['daemon_postexec']:
> +            if inst.startswith('unlink:'):
> +                lockpath = inst[7:]
> +                if os.path.exists(lockpath):
> +                    procutil.stdout.flush()
> +                    procutil.stderr.flush()
> +
> +                    fd = os.open(lockpath,
> +                                 os.O_WRONLY | os.O_APPEND | os.O_BINARY)
> +                    try:
> +                        os.dup2(fd, 1)
> +                        os.dup2(fd, 2)

Nit: procutil.stdout/stderr.fileno() will look slightly nicer.

>              if pid < 0:
> +                # If the daemonized process managed to write out an error msg,
> +                # report it.
> +                if pycompat.iswindows and os.path.exists(lockpath):
> +                    with open(lockpath) as log:

Nit: 'rb'

> +
> +        lockpath = None
>          for inst in opts['daemon_postexec']:
>              if inst.startswith('unlink:'):
>                  lockpath = inst[7:]
> -                os.unlink(lockpath)
>              elif inst.startswith('chdir:'):
>                  os.chdir(inst[6:])
>              elif inst != 'none':
> @@ -107,6 +135,11 @@ def runservice(opts, parentfn=None, init
>          if logfile and logfilefd not in (0, 1, 2):
>              os.close(logfilefd)
>  
> +        # Only unlink after redirecting stdout/stderr, so Windows doesn't
> +        # complain about a sharing violation.
> +        if lockpath:
> +            os.unlink(lockpath)

Perhaps we should build a dict of 'daemon_postexec' instructions instead of
parsing them twice. This "lockpath" change makes them effectively unordered.

Patch

diff --git a/mercurial/server.py b/mercurial/server.py
--- a/mercurial/server.py
+++ b/mercurial/server.py
@@ -30,6 +30,27 @@  def runservice(opts, parentfn=None, init
                runargs=None, appendpid=False):
     '''Run a command as a service.'''
 
+    # When daemonized on Windows, redirect stdout/stderr to the lockfile (which
+    # gets cleaned up after the child is up and running), so that the parent can
+    # read and print the error if this child dies early.  See 594dd384803c.  On
+    # other platforms, the child can write to the parent's stdio directly, until
+    # it is redirected prior to runfn().
+    if pycompat.iswindows and opts['daemon_postexec']:
+        for inst in opts['daemon_postexec']:
+            if inst.startswith('unlink:'):
+                lockpath = inst[7:]
+                if os.path.exists(lockpath):
+                    procutil.stdout.flush()
+                    procutil.stderr.flush()
+
+                    fd = os.open(lockpath,
+                                 os.O_WRONLY | os.O_APPEND | os.O_BINARY)
+                    try:
+                        os.dup2(fd, 1)
+                        os.dup2(fd, 2)
+                    finally:
+                        os.close(fd)
+
     def writepid(pid):
         if opts['pid_file']:
             if appendpid:
@@ -61,6 +82,12 @@  def runservice(opts, parentfn=None, init
                 return not os.path.exists(lockpath)
             pid = procutil.rundetached(runargs, condfn)
             if pid < 0:
+                # If the daemonized process managed to write out an error msg,
+                # report it.
+                if pycompat.iswindows and os.path.exists(lockpath):
+                    with open(lockpath) as log:
+                        for line in log:
+                            procutil.stderr.write(line)
                 raise error.Abort(_('child process failed to start'))
             writepid(pid)
         finally:
@@ -81,10 +108,11 @@  def runservice(opts, parentfn=None, init
             os.setsid()
         except AttributeError:
             pass
+
+        lockpath = None
         for inst in opts['daemon_postexec']:
             if inst.startswith('unlink:'):
                 lockpath = inst[7:]
-                os.unlink(lockpath)
             elif inst.startswith('chdir:'):
                 os.chdir(inst[6:])
             elif inst != 'none':
@@ -107,6 +135,11 @@  def runservice(opts, parentfn=None, init
         if logfile and logfilefd not in (0, 1, 2):
             os.close(logfilefd)
 
+        # Only unlink after redirecting stdout/stderr, so Windows doesn't
+        # complain about a sharing violation.
+        if lockpath:
+            os.unlink(lockpath)
+
     if runfn:
         return runfn()