Patchwork [2,of,2] lock: include Linux pid namespace identifier in prefix

login
register
mail settings
Submitter Jun Wu
Date Feb. 10, 2017, 10:09 p.m.
Message ID <1884652829f6e68bde6a.1486764575@localhost.localdomain>
Download mbox | patch
Permalink /patch/18416/
State Accepted
Headers show

Comments

Jun Wu - Feb. 10, 2017, 10:09 p.m.
# HG changeset patch
# User Jun Wu <quark@fb.com>
# Date 1486763791 28800
#      Fri Feb 10 13:56:31 2017 -0800
# Node ID 1884652829f6e68bde6ab3e9b0e9ca1da9bed40c
# Parent  f8eb762559b242ec12b199db1ce27c2930881c75
# Available At https://bitbucket.org/quark-zju/hg-draft
#              hg pull https://bitbucket.org/quark-zju/hg-draft -r 1884652829f6
lock: include Linux pid namespace identifier in prefix

Previously, the lock only contains a hostname as an attempt to detect pid
namespace difference. However, that's not enough on modern Linux - a single
hostname could have different pid namespaces.

That means if people run hg inside different PID namespaces with a same UTS
namespae, the lock would be broken - an hg proccess in pid namespace A will
think the lock having a "random" pid in pid namespace B is "dead" and remove
it.

This patch solves the above issue by appending an PID namespace identifier of
the current process to the lock prefix ("hostname"). It depends on /proc
being mounted properly. But I don't think there is a better way to get pid
namespace identifier reliably.
Jun Wu - Feb. 11, 2017, 12:50 a.m.
I'd like to note that although this patch prevents repo corruption when
running hg inside different containers (which has different pid namespaces),
it does not prevent deadlock - if an hg process is SIGKILL-ed, every other
process will not able to take or remove the lock.

I think if we do know the repo is not on NFS, and the system supports
flock(), flock() is way more robust and solves all kinds of pain here.

I hereby propose a new repo requirement "flock", once set, use flock instead
of the traditional lock file. It's off by default. Thoughts?

Excerpts from Jun Wu's message of 2017-02-10 14:09:35 -0800:
> # HG changeset patch
> # User Jun Wu <quark@fb.com>
> # Date 1486763791 28800
> #      Fri Feb 10 13:56:31 2017 -0800
> # Node ID 1884652829f6e68bde6ab3e9b0e9ca1da9bed40c
> # Parent  f8eb762559b242ec12b199db1ce27c2930881c75
> # Available At https://bitbucket.org/quark-zju/hg-draft 
> #              hg pull https://bitbucket.org/quark-zju/hg-draft  -r 1884652829f6
> lock: include Linux pid namespace identifier in prefix
> 
> Previously, the lock only contains a hostname as an attempt to detect pid
> namespace difference. However, that's not enough on modern Linux - a single
> hostname could have different pid namespaces.
> 
> That means if people run hg inside different PID namespaces with a same UTS
> namespae, the lock would be broken - an hg proccess in pid namespace A will
> think the lock having a "random" pid in pid namespace B is "dead" and remove
> it.
> 
> This patch solves the above issue by appending an PID namespace identifier of
> the current process to the lock prefix ("hostname"). It depends on /proc
> being mounted properly. But I don't think there is a better way to get pid
> namespace identifier reliably.
> 
> diff --git a/mercurial/lock.py b/mercurial/lock.py
> --- a/mercurial/lock.py
> +++ b/mercurial/lock.py
> @@ -10,4 +10,5 @@ from __future__ import absolute_import
>  import contextlib
>  import errno
> +import os
>  import socket
>  import time
> @@ -16,4 +17,5 @@ import warnings
>  from . import (
>      error,
> +    pycompat,
>      util,
>  )
> @@ -23,7 +25,15 @@ def _getlockprefix():
>  
>      It's useful to detect "dead" processes and remove stale locks with
> -    confidence. Typically it's just hostname.
> +    confidence. Typically it's just hostname. On modern linux, we include an
> +    extra Linux-specific pid namespace identifier.
>      """
> -    return socket.gethostname()
> +    result = socket.gethostname()
> +    if pycompat.sysplatform.startswith('linux'):
> +        try:
> +            result += '/%x' % os.stat('/proc/self/ns/pid').st_ino
> +        except OSError as ex:
> +            if ex.errno not in (errno.ENOENT, errno.EACCES, errno.ENOTDIR):
> +                raise
> +    return result
>  
>  class lock(object):
Augie Fackler - Feb. 13, 2017, 7:55 p.m.
On Fri, Feb 10, 2017 at 04:50:55PM -0800, Jun Wu wrote:
> I'd like to note that although this patch prevents repo corruption when
> running hg inside different containers (which has different pid namespaces),
> it does not prevent deadlock - if an hg process is SIGKILL-ed, every other
> process will not able to take or remove the lock.

Sigh. Thanks for the fix, queued (and a fist shaken at this weird/poor
choice from linux containers.)

>
> I think if we do know the repo is not on NFS, and the system supports
> flock(), flock() is way more robust and solves all kinds of pain here.
>
> I hereby propose a new repo requirement "flock", once set, use flock instead
> of the traditional lock file. It's off by default. Thoughts?

I'm...not categorically opposed to it, though it feels pretty risky. I
know git doesn't use flock() either - presumably there's a good reason
I don't know about that neither tool relies on it?

>
> Excerpts from Jun Wu's message of 2017-02-10 14:09:35 -0800:
> > # HG changeset patch
> > # User Jun Wu <quark@fb.com>
> > # Date 1486763791 28800
> > #      Fri Feb 10 13:56:31 2017 -0800
> > # Node ID 1884652829f6e68bde6ab3e9b0e9ca1da9bed40c
> > # Parent  f8eb762559b242ec12b199db1ce27c2930881c75
> > # Available At https://bitbucket.org/quark-zju/hg-draft
> > #              hg pull https://bitbucket.org/quark-zju/hg-draft  -r 1884652829f6
> > lock: include Linux pid namespace identifier in prefix
> >
> > Previously, the lock only contains a hostname as an attempt to detect pid
> > namespace difference. However, that's not enough on modern Linux - a single
> > hostname could have different pid namespaces.
> >
> > That means if people run hg inside different PID namespaces with a same UTS
> > namespae, the lock would be broken - an hg proccess in pid namespace A will
> > think the lock having a "random" pid in pid namespace B is "dead" and remove
> > it.
> >
> > This patch solves the above issue by appending an PID namespace identifier of
> > the current process to the lock prefix ("hostname"). It depends on /proc
> > being mounted properly. But I don't think there is a better way to get pid
> > namespace identifier reliably.
> >
> > diff --git a/mercurial/lock.py b/mercurial/lock.py
> > --- a/mercurial/lock.py
> > +++ b/mercurial/lock.py
> > @@ -10,4 +10,5 @@ from __future__ import absolute_import
> >  import contextlib
> >  import errno
> > +import os
> >  import socket
> >  import time
> > @@ -16,4 +17,5 @@ import warnings
> >  from . import (
> >      error,
> > +    pycompat,
> >      util,
> >  )
> > @@ -23,7 +25,15 @@ def _getlockprefix():
> >
> >      It's useful to detect "dead" processes and remove stale locks with
> > -    confidence. Typically it's just hostname.
> > +    confidence. Typically it's just hostname. On modern linux, we include an
> > +    extra Linux-specific pid namespace identifier.
> >      """
> > -    return socket.gethostname()
> > +    result = socket.gethostname()
> > +    if pycompat.sysplatform.startswith('linux'):
> > +        try:
> > +            result += '/%x' % os.stat('/proc/self/ns/pid').st_ino
> > +        except OSError as ex:
> > +            if ex.errno not in (errno.ENOENT, errno.EACCES, errno.ENOTDIR):
> > +                raise
> > +    return result
> >
> >  class lock(object):
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Jun Wu - Feb. 13, 2017, 9:03 p.m.
Excerpts from Augie Fackler's message of 2017-02-13 14:55:57 -0500:
> On Fri, Feb 10, 2017 at 04:50:55PM -0800, Jun Wu wrote:
> > I'd like to note that although this patch prevents repo corruption when
> > running hg inside different containers (which has different pid namespaces),
> > it does not prevent deadlock - if an hg process is SIGKILL-ed, every other
> > process will not able to take or remove the lock.
> 
> Sigh. Thanks for the fix, queued (and a fist shaken at this weird/poor
> choice from linux containers.)

I think Linux pidns has no choice - pid 1 is special so 1 will collide for
sure. If pids are not unique globally, we will have the issue.

> > I think if we do know the repo is not on NFS, and the system supports
> > flock(), flock() is way more robust and solves all kinds of pain here.
> >
> > I hereby propose a new repo requirement "flock", once set, use flock instead
> > of the traditional lock file. It's off by default. Thoughts?
> 
> I'm...not categorically opposed to it, though it feels pretty risky. I
> know git doesn't use flock() either - presumably there's a good reason
> I don't know about that neither tool relies on it?

I think that's because the troubled use-case is rare. It requires:

  1. Different pid namespaces
  2. Shared mount namespace / subtree

"2" is not satisfied for typical containers. But it happens at Facebook.

I think it could be an extension outside core.
Siddharth Agarwal - Feb. 13, 2017, 9:07 p.m.
On 2/13/17 13:03, Jun Wu wrote:
> I think it could be an extension outside core.


I'd say make it a clone-time option, turned off by default, with caveats 
documented. At FB we ban the situations where flock wouldn't work for 
scalability reasons anyway.

Note that flock requires the file be permanently on disk. We shouldn't 
use .hg/lock and .hg/wlock -- instead, we'll probably have to 
create/assume new files .hg/flock and .hg/fwlock.]
Jun Wu - Feb. 13, 2017, 9:12 p.m.
Excerpts from Siddharth Agarwal's message of 2017-02-13 13:07:08 -0800:
> On 2/13/17 13:03, Jun Wu wrote:
> > I think it could be an extension outside core.
> 
> I'd say make it a clone-time option, turned off by default, with caveats 
> documented. At FB we ban the situations where flock wouldn't work for 
> scalability reasons anyway.

flock should scale better than the file-existence locks - it does not
write physically.

> Note that flock requires the file be permanently on disk. We shouldn't 
> use .hg/lock and .hg/wlock -- instead, we'll probably have to 
> create/assume new files .hg/flock and .hg/fwlock.]

That's not an issue - flock works on directories. So locking ".hg" or
".hg/store" would be feasible.
Simon Farnsworth - Feb. 13, 2017, 10:32 p.m.
On 13/02/2017 19:55, Augie Fackler wrote:
> On Fri, Feb 10, 2017 at 04:50:55PM -0800, Jun Wu wrote:
<snip>
>>
>> I think if we do know the repo is not on NFS, and the system supports
>> flock(), flock() is way more robust and solves all kinds of pain here.
>>
>> I hereby propose a new repo requirement "flock", once set, use flock instead
>> of the traditional lock file. It's off by default. Thoughts?
>
> I'm...not categorically opposed to it, though it feels pretty risky. I
> know git doesn't use flock() either - presumably there's a good reason
> I don't know about that neither tool relies on it?
>

The usual reason is the complete clusterfsck that's flock() on NFS, 
combined with the corporate love of NFS homedirs.

Basically, every significant OS (Linux, macOS, FreeBSD) supports a mode 
where flock() works, but is only respected on the local host (the server 
is never told about the lock). This is a recipe for silent corruption in 
places that use NFSv2 or v3, since there's no way to tell whether locks 
are handled server-side by every client, or if one or more clients are 
handling locks locally and not telling the server about it.

For bonus "excitement", if users have any degree of admin control of 
their own system, they can get a slight performance boost by switching 
locking to local-only (as there's no longer one RTT delay for each 
flock() call). At least, until they forget they did that, and use a 
second host for a concurrent operation...
Augie Fackler - Feb. 13, 2017, 10:35 p.m.
> On Feb 13, 2017, at 17:32, Simon Farnsworth <simonfar@fb.com> wrote:
> 
> On 13/02/2017 19:55, Augie Fackler wrote:
>> On Fri, Feb 10, 2017 at 04:50:55PM -0800, Jun Wu wrote:
> <snip>
>>> 
>>> I think if we do know the repo is not on NFS, and the system supports
>>> flock(), flock() is way more robust and solves all kinds of pain here.
>>> 
>>> I hereby propose a new repo requirement "flock", once set, use flock instead
>>> of the traditional lock file. It's off by default. Thoughts?
>> 
>> I'm...not categorically opposed to it, though it feels pretty risky. I
>> know git doesn't use flock() either - presumably there's a good reason
>> I don't know about that neither tool relies on it?
>> 
> 
> The usual reason is the complete clusterfsck that's flock() on NFS, combined with the corporate love of NFS homedirs.
> 
> Basically, every significant OS (Linux, macOS, FreeBSD) supports a mode where flock() works, but is only respected on the local host (the server is never told about the lock). This is a recipe for silent corruption in places that use NFSv2 or v3, since there's no way to tell whether locks are handled server-side by every client, or if one or more clients are handling locks locally and not telling the server about it.
> 
> For bonus "excitement", if users have any degree of admin control of their own system, they can get a slight performance boost by switching locking to local-only (as there's no longer one RTT delay for each flock() call). At least, until they forget they did that, and use a second host for a concurrent operation...

All of this makes me wary of a config knob that lets users opt in to corruption. I mean, I've heard horror stories about Subversion users setting "SVN_I_LOVE_CORRUPTED_WORKING_COPIES_SO_DISABLE_SLEEP_FOR_TIMESTAMPS" in their environment, only to get upset when the promised corruption occurred...

> -- 
> Simon Farnsworth
Gregory Szorc - Feb. 14, 2017, 12:22 a.m.
> On Feb 13, 2017, at 12:55, Augie Fackler <raf@durin42.com> wrote:
> 
>> On Fri, Feb 10, 2017 at 04:50:55PM -0800, Jun Wu wrote:
>> I'd like to note that although this patch prevents repo corruption when
>> running hg inside different containers (which has different pid namespaces),
>> it does not prevent deadlock - if an hg process is SIGKILL-ed, every other
>> process will not able to take or remove the lock.
> 
> Sigh. Thanks for the fix, queued (and a fist shaken at this weird/poor
> choice from linux containers.)

Process namespaces (see clone(2) man page) are a really nifty security feature. Unfortunately they do have the side effect of invalidating assumptions made since the beginning of UNIX.

This patch is a step in the right direction. However, it's worth calling out that not all containers have a proc filesystem. There have been security vulns due to containers having access to procfs. So it is common to reduce attack surface area by not mounting it.

> 
>> 
>> I think if we do know the repo is not on NFS, and the system supports
>> flock(), flock() is way more robust and solves all kinds of pain here.
>> 
>> I hereby propose a new repo requirement "flock", once set, use flock instead
>> of the traditional lock file. It's off by default. Thoughts?
> 
> I'm...not categorically opposed to it, though it feels pretty risky. I
> know git doesn't use flock() either - presumably there's a good reason
> I don't know about that neither tool relies on it?
> 
>> 
>> Excerpts from Jun Wu's message of 2017-02-10 14:09:35 -0800:
>>> # HG changeset patch
>>> # User Jun Wu <quark@fb.com>
>>> # Date 1486763791 28800
>>> #      Fri Feb 10 13:56:31 2017 -0800
>>> # Node ID 1884652829f6e68bde6ab3e9b0e9ca1da9bed40c
>>> # Parent  f8eb762559b242ec12b199db1ce27c2930881c75
>>> # Available At https://bitbucket.org/quark-zju/hg-draft
>>> #              hg pull https://bitbucket.org/quark-zju/hg-draft  -r 1884652829f6
>>> lock: include Linux pid namespace identifier in prefix
>>> 
>>> Previously, the lock only contains a hostname as an attempt to detect pid
>>> namespace difference. However, that's not enough on modern Linux - a single
>>> hostname could have different pid namespaces.
>>> 
>>> That means if people run hg inside different PID namespaces with a same UTS
>>> namespae, the lock would be broken - an hg proccess in pid namespace A will
>>> think the lock having a "random" pid in pid namespace B is "dead" and remove
>>> it.
>>> 
>>> This patch solves the above issue by appending an PID namespace identifier of
>>> the current process to the lock prefix ("hostname"). It depends on /proc
>>> being mounted properly. But I don't think there is a better way to get pid
>>> namespace identifier reliably.
>>> 
>>> diff --git a/mercurial/lock.py b/mercurial/lock.py
>>> --- a/mercurial/lock.py
>>> +++ b/mercurial/lock.py
>>> @@ -10,4 +10,5 @@ from __future__ import absolute_import
>>> import contextlib
>>> import errno
>>> +import os
>>> import socket
>>> import time
>>> @@ -16,4 +17,5 @@ import warnings
>>> from . import (
>>>     error,
>>> +    pycompat,
>>>     util,
>>> )
>>> @@ -23,7 +25,15 @@ def _getlockprefix():
>>> 
>>>     It's useful to detect "dead" processes and remove stale locks with
>>> -    confidence. Typically it's just hostname.
>>> +    confidence. Typically it's just hostname. On modern linux, we include an
>>> +    extra Linux-specific pid namespace identifier.
>>>     """
>>> -    return socket.gethostname()
>>> +    result = socket.gethostname()
>>> +    if pycompat.sysplatform.startswith('linux'):
>>> +        try:
>>> +            result += '/%x' % os.stat('/proc/self/ns/pid').st_ino
>>> +        except OSError as ex:
>>> +            if ex.errno not in (errno.ENOENT, errno.EACCES, errno.ENOTDIR):
>>> +                raise
>>> +    return result
>>> 
>>> class lock(object):
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Jun Wu - Feb. 14, 2017, 1:25 a.m.
Excerpts from Gregory Szorc's message of 2017-02-13 17:22:59 -0700:
> 
> > On Feb 13, 2017, at 12:55, Augie Fackler <raf@durin42.com> wrote:
> > 
> >> On Fri, Feb 10, 2017 at 04:50:55PM -0800, Jun Wu wrote:
> >> I'd like to note that although this patch prevents repo corruption when
> >> running hg inside different containers (which has different pid namespaces),
> >> it does not prevent deadlock - if an hg process is SIGKILL-ed, every other
> >> process will not able to take or remove the lock.
> > 
> > Sigh. Thanks for the fix, queued (and a fist shaken at this weird/poor
> > choice from linux containers.)
> 
> Process namespaces (see clone(2) man page) are a really nifty security
> feature. Unfortunately they do have the side effect of invalidating
> assumptions made since the beginning of UNIX.
> 
> This patch is a step in the right direction. However, it's worth calling
> out that not all containers have a proc filesystem. There have been
> security vulns due to containers having access to procfs. So it is common
> to reduce attack surface area by not mounting it.

I'm aware of that. Also note that /proc could be in a different pidns -
without an explicit remount, it just shows things as seen from the parent
pidns:

    str(os.getpid())     != basename(os.readlink('/proc/self'))
    # pid in this pidns,    pid in the original (parent) pidns

Therefore I wrote in the commit message:

    It depends on /proc being mounted properly. But I don't think there is a
    better way to get pid namespace identifier reliably.
Gregory Szorc - Feb. 14, 2017, 3:21 a.m.
> On Feb 13, 2017, at 18:25, Jun Wu <quark@fb.com> wrote:
> 
> Excerpts from Gregory Szorc's message of 2017-02-13 17:22:59 -0700:
>> 
>>>> On Feb 13, 2017, at 12:55, Augie Fackler <raf@durin42.com> wrote:
>>>> 
>>>> On Fri, Feb 10, 2017 at 04:50:55PM -0800, Jun Wu wrote:
>>>> I'd like to note that although this patch prevents repo corruption when
>>>> running hg inside different containers (which has different pid namespaces),
>>>> it does not prevent deadlock - if an hg process is SIGKILL-ed, every other
>>>> process will not able to take or remove the lock.
>>> 
>>> Sigh. Thanks for the fix, queued (and a fist shaken at this weird/poor
>>> choice from linux containers.)
>> 
>> Process namespaces (see clone(2) man page) are a really nifty security
>> feature. Unfortunately they do have the side effect of invalidating
>> assumptions made since the beginning of UNIX.
>> 
>> This patch is a step in the right direction. However, it's worth calling
>> out that not all containers have a proc filesystem. There have been
>> security vulns due to containers having access to procfs. So it is common
>> to reduce attack surface area by not mounting it.
> 
> I'm aware of that. Also note that /proc could be in a different pidns -
> without an explicit remount, it just shows things as seen from the parent
> pidns:
> 
>    str(os.getpid())     != basename(os.readlink('/proc/self'))
>    # pid in this pidns,    pid in the original (parent) pidns
> 
> Therefore I wrote in the commit message:
> 
>    It depends on /proc being mounted properly. But I don't think there is a
>    better way to get pid namespace identifier reliably.

It is a relatively new Linux feature, but see NS_GET_PARENT and NS_GET_USERNS for ioctl(). http://man7.org/linux/man-pages/man7/namespaces.7.html. Of course, the file descriptor passed to ioctl refers to a procfs item, so it doesn't look like this gets us anywhere.

Patch

diff --git a/mercurial/lock.py b/mercurial/lock.py
--- a/mercurial/lock.py
+++ b/mercurial/lock.py
@@ -10,4 +10,5 @@  from __future__ import absolute_import
 import contextlib
 import errno
+import os
 import socket
 import time
@@ -16,4 +17,5 @@  import warnings
 from . import (
     error,
+    pycompat,
     util,
 )
@@ -23,7 +25,15 @@  def _getlockprefix():
 
     It's useful to detect "dead" processes and remove stale locks with
-    confidence. Typically it's just hostname.
+    confidence. Typically it's just hostname. On modern linux, we include an
+    extra Linux-specific pid namespace identifier.
     """
-    return socket.gethostname()
+    result = socket.gethostname()
+    if pycompat.sysplatform.startswith('linux'):
+        try:
+            result += '/%x' % os.stat('/proc/self/ns/pid').st_ino
+        except OSError as ex:
+            if ex.errno not in (errno.ENOENT, errno.EACCES, errno.ENOTDIR):
+                raise
+    return result
 
 class lock(object):