Patchwork [2,of,4] revlog: add a context manager to allow file handle reuse

login
register
mail settings
Submitter Gregory Szorc
Date Nov. 2, 2016, 1:16 a.m.
Message ID <d631065a702fa7eb9562.1478049397@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/17256/
State Superseded
Headers show

Comments

Gregory Szorc - Nov. 2, 2016, 1:16 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1477099818 25200
#      Fri Oct 21 18:30:18 2016 -0700
# Node ID d631065a702fa7eb956258e2289679d5902ccff6
# Parent  fb93d9a0a24db5a93a6a6758eacc6ba5ca37531e
revlog: add a context manager to allow file handle reuse

Currently, read-only operations traversing revlogs must open a new
file handle whenever they need uncached data from the underlying
revlog. This can add overhead to operations such as changegroup
generation.

The revlog APIs have a mechanism for reusing a file descriptor
for I/O. This was added by me a while ago as a means to speed up
revlog.addgroup(). At that time, I didn't do any work around
reusing file handles for read operations.

This patch introduces a context manager to cache an open file handle
on a revlog. When the context manager is active, revlog reads will
be routed to the opened file handle instead of opening a single
use file handle.

There is definitely room to improve the API. We could probably
even refactor the write file descriptor caching to use a context
manager - possibly the same one! However, this is a bit of work
since the write file handle can be swapped out if a revlog
transitions from inline to non-inline in the course of adding
revisions.
Yuya Nishihara - Nov. 4, 2016, 7:43 a.m.
On Tue, 01 Nov 2016 18:16:37 -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1477099818 25200
> #      Fri Oct 21 18:30:18 2016 -0700
> # Node ID d631065a702fa7eb956258e2289679d5902ccff6
> # Parent  fb93d9a0a24db5a93a6a6758eacc6ba5ca37531e
> revlog: add a context manager to allow file handle reuse

> +    @contextlib.contextmanager
> +    def cachefilehandle(self):
> +        """Maintain a persistent file handle during operations.
> +
> +        When this context manager is active, a file descriptor will be reused
> +        for all read operations, ensuring the underlying revlog file isn't
> +        reopened multiple times.
> +        """
> +        if self._readfh:
> +            raise error.Abort('cachefilehandle already active')
> +
> +        # Inline revlogs have data chunks cached at open time. If we opened
> +        # a file handle it wouldn't be read.
> +        if self._inline:
> +            yield
> +            return
> +
> +        try:
> +            self._readfh = self.opener(self.datafile)
> +            yield
> +        finally:
> +            self._readfh.close()
> +            self._readfh = None

This would crash if open() failed. It should be:

  self._readfh = self.opener(self.datafile)
  try:
      yield
  finally:
      self._readfh.close()
      self._readfh = None

Should I fix it in flight? The other changes look good to me.

Patch

diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -14,6 +14,7 @@  and O(changes) merge between branches.
 from __future__ import absolute_import
 
 import collections
+import contextlib
 import errno
 import hashlib
 import os
@@ -314,6 +315,8 @@  class revlog(object):
         # revnum -> (chain-length, sum-delta-length)
         self._chaininfocache = {}
 
+        self._readfh = None
+
     def tip(self):
         return self.node(len(self.index) - 2)
     def __contains__(self, rev):
@@ -1018,6 +1021,30 @@  class revlog(object):
         else:
             self._chunkcache = offset, data
 
+    @contextlib.contextmanager
+    def cachefilehandle(self):
+        """Maintain a persistent file handle during operations.
+
+        When this context manager is active, a file descriptor will be reused
+        for all read operations, ensuring the underlying revlog file isn't
+        reopened multiple times.
+        """
+        if self._readfh:
+            raise error.Abort('cachefilehandle already active')
+
+        # Inline revlogs have data chunks cached at open time. If we opened
+        # a file handle it wouldn't be read.
+        if self._inline:
+            yield
+            return
+
+        try:
+            self._readfh = self.opener(self.datafile)
+            yield
+        finally:
+            self._readfh.close()
+            self._readfh = None
+
     def _loadchunk(self, offset, length, df=None):
         """Load a segment of raw data from the revlog.
 
@@ -1029,6 +1056,12 @@  class revlog(object):
 
         Returns a str or buffer of raw byte data.
         """
+        # Use cached file handle from context manager automatically.
+        # self._readfh is always opened in read-only mode. df may be opened
+        # in append mode. The context manager should only be active during
+        # read operations. So this mismatch is OK.
+        df = df or self._readfh
+
         if df is not None:
             closehandle = False
         else: