Patchwork [4,of,6,frozen-repos] localrepo: ability to obtain a read-only snapshot of a repo

login
register
mail settings
Submitter Gregory Szorc
Date Nov. 22, 2015, 1:14 a.m.
Message ID <888c2171adffa8340406.1448154841@ubuntu-main>
Download mbox | patch
Permalink /patch/11568/
State Deferred
Delegated to: Pierre-Yves David
Headers show

Comments

Gregory Szorc - Nov. 22, 2015, 1:14 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1448154001 28800
#      Sat Nov 21 17:00:01 2015 -0800
# Node ID 888c2171adffa8340406b50aae02375f7bef50f4
# Parent  f559e16232491e44fc4dac6c12872cb2e9164d27
localrepo: ability to obtain a read-only snapshot of a repo

As part of performing some performance profiling, I noticed that a lot
of time can be spent resolving repoview.changelog (the changelog must
be resolved when performing repo[x] for example). The reason is that
the changelog lookup function needs to validate that the cached set
of revisions exposed to the view is still accurate. While there are
comments in repoview.py hinting at future optimizations to this
code, I think the ideal solution is to avoid having to validate a
cache altogether.

Many repository operations are short lived and/or assume they are
operating on a consistent snapshot of the repository. However, we
don't really ensure consistency anywhere. It is possible for someone
to e.g. start a revset query (which is lazy), perform a commit, then
continue iterating over the revset. This could produce incorrect
results due to the changelog mutating over the course of the query.

This patch introduces a freezerepo() function which returns a new
repository instance which is read only and is a snapshot of the
state of the repository at the time it was passed in.

Writing to the frozen repo is prevented by overriding methods used to
perform common write operations. Not all writes are likely prevented,
but it's better than nothing. In addition, some other functions related
to mutating repo state will result in aborts when called as well.

The snapshot of the repository is currently implemented by replacing
the changelog lookup function with a resolved changelog instance.
Accessing the changelog on a frozen repo is thus a normal property
lookup and not a function call. The performance impact will be
demonstrated in subsequent patches.

There is an inline TODO regarding making unfiltered repos frozen.
I'm not sure why a few random tests are complaining about unknown
working copy parent when we freeze the returned unfiltered repo.
This is follow up work for another day.

The localrepository class also gained a helper function to produce
a frozen repo. I know we don't like more methods on this class. But
frozen repos are similar to filtered repos. I anticipate we'll be
operating on frozen repos a lot in the near future. And localrepo.py
has import cycle considerations. So I think having the convenience
method is justified.
Pierre-Yves David - Nov. 23, 2015, 5:23 a.m.
On 11/21/2015 05:14 PM, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1448154001 28800
> #      Sat Nov 21 17:00:01 2015 -0800
> # Node ID 888c2171adffa8340406b50aae02375f7bef50f4
> # Parent  f559e16232491e44fc4dac6c12872cb2e9164d27
> localrepo: ability to obtain a read-only snapshot of a repo
>
> As part of performing some performance profiling, I noticed that a lot
> of time can be spent resolving repoview.changelog (the changelog must
> be resolved when performing repo[x] for example). The reason is that
> the changelog lookup function needs to validate that the cached set
> of revisions exposed to the view is still accurate. While there are
> comments in repoview.py hinting at future optimizations to this
> code, I think the ideal solution is to avoid having to validate a
> cache altogether.

While I like the idea of having frozen repository, it introduce an 
additional level of complexity (somewhere already not simple) and will 
probably take some time to right and properly performing.

Can we look at way to speed up the cache validation logic (or get a 
better approach for invalidating it, eg: nothing really change until we 
have a transaction).

Speeding this would have a massive impact all across the board without 
changing much of the current architecture.

I'll look at the implementation of the frozen repo more in details later.
Gregory Szorc - Dec. 4, 2015, 11:45 p.m.
On Sun, Nov 22, 2015 at 9:23 PM, Pierre-Yves David <
pierre-yves.david@ens-lyon.org> wrote:

>
>
> On 11/21/2015 05:14 PM, Gregory Szorc wrote:
>
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1448154001 28800
>> #      Sat Nov 21 17:00:01 2015 -0800
>> # Node ID 888c2171adffa8340406b50aae02375f7bef50f4
>> # Parent  f559e16232491e44fc4dac6c12872cb2e9164d27
>> localrepo: ability to obtain a read-only snapshot of a repo
>>
>> As part of performing some performance profiling, I noticed that a lot
>> of time can be spent resolving repoview.changelog (the changelog must
>> be resolved when performing repo[x] for example). The reason is that
>> the changelog lookup function needs to validate that the cached set
>> of revisions exposed to the view is still accurate. While there are
>> comments in repoview.py hinting at future optimizations to this
>> code, I think the ideal solution is to avoid having to validate a
>> cache altogether.
>>
>
> While I like the idea of having frozen repository, it introduce an
> additional level of complexity (somewhere already not simple) and will
> probably take some time to right and properly performing.
>
> Can we look at way to speed up the cache validation logic (or get a better
> approach for invalidating it, eg: nothing really change until we have a
> transaction).
>
> Speeding this would have a massive impact all across the board without
> changing much of the current architecture.
>
> I'll look at the implementation of the frozen repo more in details later.


Pierre-Yves just sent a refactor to repoview to remove a lot of the
overhead for .changelog lookup. His series seems to have the same
performance impact as my series. And his series doesn't introduce a
complicated new concept of "frozen repos" which would need to be utilized
all over the code base to become effective.

While I like the idea of a separate class for a read-only snapshot of the
state of a repository, I think that is an idea left for another day since
the main benefit (performance) is achieved through simpler means.

Patch

diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -413,16 +413,24 @@  class localrepository(object):
     def filtered(self, name):
         """Return a filtered version of a repository"""
         # build a new class with the mixin and the current class
         # (possibly subclass of the repo)
         class proxycls(repoview.repoview, self.unfiltered().__class__):
             pass
         return proxycls(self, name)
 
+    def frozen(self):
+        """Obtain a read-only snapshot of the repository."""
+        # Don't create a new object if we're already frozen.
+        if getattr(self, '_frozencl', None):
+            return self
+
+        return freezerepo(self)
+
     @repofilecache('bookmarks')
     def _bookmarks(self):
         return bookmarks.bmstore(self)
 
     @repofilecache('bookmarks.current')
     def _activebookmark(self):
         return bookmarks.readactive(self)
 
@@ -1931,8 +1939,81 @@  def undoname(fn):
     assert name.startswith('journal')
     return os.path.join(base, name.replace('journal', 'undo', 1))
 
 def instance(ui, path, create):
     return localrepository(ui, util.urllocalpath(path), create)
 
 def islocal(path):
     return True
+
+def freezerepo(repo):
+    """Obtain a read-only snapshot of a repository.
+
+    Returns a new repository instance based on the existing one that
+    is read-only and frozen in time. Lookups that would normally need
+    o be validated against change are not performed. This property makes
+    frozen repository instances suitable for read-only queries where
+    performance is important.
+
+    Any active repo filter, if any, will be preserved in the returned
+    instance.
+    """
+    msg = _('%s should not be called on frozen repos')
+
+    class frozenrepo(repo.__class__):
+        def __init__(self, orig):
+            # We need to call the constructor for repoview if we inherit
+            # from one.
+            if isinstance(orig, repoview.repoview):
+                repoview.repoview.__init__(self, orig, orig.filtername)
+
+            # Stash away a reference to the original repo so we can proxy.
+            object.__setattr__(self, '_unfrozenrepo', orig)
+
+            # changelog object lookup is typically dynamic. This is not
+            # needed on frozen repos. So resolve the changelog once and
+            # cache it.
+            object.__setattr__(self, '_frozencl', orig.changelog)
+
+        @property
+        def changelog(self):
+            return self._frozencl
+
+        # We need to override functions that return repos to returned frozen
+        # repos as well.
+        def unfiltered(self):
+            # TODO this should return a frozen repo. However, a few tests are
+            # failing due to unknown parent of working directory state.
+            return self._unfrozenrepo.unfiltered()
+
+        def filtered(self, name):
+            current = getattr(self, 'filtername', None)
+            if current == name:
+                return self
+
+            return freezerepo(self._unfrozenrepo.unfiltered().filtered(name))
+
+        # Proxy all operations to original repo.
+        def __getattr__(self, attr):
+            return getattr(self._unfrozenrepo, attr)
+
+        def __setattr__(self, attr, value):
+            return setattr(self._unfrozenrepo, attr, value)
+
+        def __delattr__(self, attr):
+            return delattr(self._unfrozenrepo, attr)
+
+        # Catch calls to functions that should never be called on frozen
+        # repo instances.
+        def lock(self, *args, **kwargs):
+            raise error.Abort(msg % 'lock()')
+
+        def transaction(self, *args, **kwargs):
+            raise error.Abort(msg % 'transaction()')
+
+        def commit(self, *args, **kwargs):
+            raise error.Abort(msg % 'commit()')
+
+        def destroying(self):
+            raise error.Abort(msg % 'destroying()')
+
+    return frozenrepo(repo)