Patchwork D7199: lock: fix race in lock-breaking code

login
register
mail settings
Submitter phabricator
Date Nov. 2, 2019, 1:51 a.m.
Message ID <differential-rev-PHID-DREV-q52b6kx2yrlvivobahzr-req@mercurial-scm.org>
Download mbox | patch
Permalink /patch/42679/
State Superseded
Headers show

Comments

phabricator - Nov. 2, 2019, 1:51 a.m.
valentin.gatienbaron created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  With low frequency, I see hg pulls fail with output like:
  abort: no such file or directory: .hg/store/lock
  
  I think what happens is, in lock.py, in:
  
    def _testlock(self, locker):
        if not self._lockshouldbebroken(locker):
            return locker
    
        # if locker dead, break lock.  must do this with another lock
        # held, or can race and break valid lock.
  
  1. MARK try: with lock(self.vfs, self.f + b'.break', timeout=0): self.vfs.unlink(self.f) except error.LockError: return locker
  
  if a lock is breakable on disk, and two hg processes concurrently get
  to MARK, a possible interleaving is: process1 finishes executing the
  function and then process2 finishes executing the function. If that
  happens, process2 will either get ENOENT in self.vfs.unlink (resulting
  in the spurious failure above), or break a valid lock and potentially
  cause repository corruption.
  
  The fix is simple enough: make sure the lock is breakable _inside_ the
  critical section, because only then can we know that no other process
  can invalidate our knowledge on the lock on disk.
  
  I don't think there are tests for this. I've tested this manually
  with:
  
  diff --git a/mercurial/lock.py b/mercurial/lock.py
  
  - a/mercurial/lock.py
  
  +++ b/mercurial/lock.py
  @@ -351,6 +351,8 @@ class lock(object):
  
    if not self._lockshouldbebroken(locker):
        return locker
  
  +        import random
  +        time.sleep(1. + random.random())
  
  1. if locker dead, break lock.  must do this with another lock
  2. held, or can race and break valid lock. try:
  
  @@ -358,6 +360,7 @@ class lock(object):
  
            self.vfs.unlink(self.f)
    except error.LockError:
        return locker
  
  +        time.sleep(1)
  
    def testlock(self):
        """return id of locker if lock is valid, else None.
  
  and I see this change of behavior before/after this commit:
  
  $ $hg init repo
   $ cd repo
   $ ln -s $HOSTNAME/effffffc:987654321 .hg/wlock
   $ touch a
   $ $hg commit -Am_ & $hg commit -Am _; wait
  -abort: No such file or directory: '/tmp/repo/.hg/wlock'
   adding a
  +warning: ignoring unknown working parent 679a8959a8ca!
  +nothing changed

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D7199

AFFECTED FILES
  mercurial/lock.py

CHANGE DETAILS




To: valentin.gatienbaron, #hg-reviewers
Cc: mercurial-devel
phabricator - Nov. 19, 2019, 4:01 a.m.
This revision is now accepted and ready to land.
indygreg added a comment.
indygreg accepted this revision.


  Thank you for tracking down this issue and for the detailed commit message!
  
  I agree there is a race between 2 processes creating the lock and that we need to verify the created lock once we have it to ensure we "won" the file write race.
  
  FWIW there are known issues with Mercurial's repository locking semantics. I believe we will need to break Mercurial's on-disk file layout to fix all the problems. But (hard-to-debug) issues like the one fixed in this patch do exist and can be fixed without breaking backwards compatibility. If this subject interests you, I suggest reading https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf which chronicles how many popular pieces of software fail to get filesystem consistency correct due to incorrect use of locks, `fsync()`, etc.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7199/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7199

To: valentin.gatienbaron, #hg-reviewers, indygreg
Cc: indygreg, mercurial-devel
phabricator - Nov. 19, 2019, 5:06 a.m.
valentin.gatienbaron added a comment.


  Thanks for the review. Did something go wrong in a rebase? I'm confused as to why there's two `self.vfs.unlink` calls in the diff displayed by phabricator.
  
  I have read the paper you mention, and I have seen repositories get corrupted for these kinds of reasons (revlog index's length may get written without the corresponding data if the machine reboots, transaction commits but fncache gets truncated when one runs out of disk space). But I don't think it talks of locking problems in the absence of crashes, as is happening here. The ordering of filesystem operations in-memory is stricter.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7199/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7199

To: valentin.gatienbaron, #hg-reviewers, indygreg
Cc: indygreg, mercurial-devel

Patch

diff --git a/mercurial/lock.py b/mercurial/lock.py
--- a/mercurial/lock.py
+++ b/mercurial/lock.py
@@ -355,6 +355,9 @@ 
         # held, or can race and break valid lock.
         try:
             with lock(self.vfs, self.f + b'.break', timeout=0):
+                locker = self._readlock()
+                if not self._lockshouldbebroken(locker):
+                    return locker
                 self.vfs.unlink(self.f)
         except error.LockError:
             return locker