Patchwork [01,of,14] cache: introduce an abstract class for cache we can upgrade incrementally

login
register
mail settings
Submitter Boris Feld
Date July 9, 2017, 5:52 p.m.
Message ID <6edb62505c697329de03.1499622727@FB>
Download mbox | patch
Permalink /patch/22161/
State Superseded
Headers show

Comments

Boris Feld - July 9, 2017, 5:52 p.m.
# HG changeset patch
# User Boris Feld <boris.feld@octobus.net>
# Date 1499458441 -7200
#      Fri Jul 07 22:14:01 2017 +0200
# Node ID 6edb62505c697329de034c2fdc47befd5896f31f
# Parent  89796a25d4bb91fb418ad3e70faad2c586902ffb
# EXP-Topic obs-cache
cache: introduce an abstract class for cache we can upgrade incrementally

Right now each class implements their own mechanism for validation, and
update. We start introducing abstract class to ultimately allow more
unification of the cache code.

The end goal of this series is to introduce a cache for some obsolescence
property, not to actually implement the cache. However, taking advantage of
adding a new cache to introduce the abstract class seems like a win.
Gregory Szorc - July 9, 2017, 8:32 p.m.
On Sun, Jul 9, 2017 at 10:52 AM, Boris Feld <boris.feld@octobus.net> wrote:

> # HG changeset patch
> # User Boris Feld <boris.feld@octobus.net>
> # Date 1499458441 -7200
> #      Fri Jul 07 22:14:01 2017 +0200
> # Node ID 6edb62505c697329de034c2fdc47befd5896f31f
> # Parent  89796a25d4bb91fb418ad3e70faad2c586902ffb
> # EXP-Topic obs-cache
> cache: introduce an abstract class for cache we can upgrade incrementally
>
> Right now each class implements their own mechanism for validation, and
> update. We start introducing abstract class to ultimately allow more
> unification of the cache code.
>
> The end goal of this series is to introduce a cache for some obsolescence
> property, not to actually implement the cache. However, taking advantage of
> adding a new cache to introduce the abstract class seems like a win.
>

I don't like saying this, but given the amount of discussion these patches
generated the last time they were proposed and given the proximity to the
code freeze, I doubt these will land in 4.3.

Would you be willing to voluntarily withdrawal them from consideration for
4.3? Or if there are less contentious patches/functionality that you would
like to land in 4.3, would you be willing to pair down the series to the
simpler/less-contentious patches so we can focus on 4.3?


>
> diff -r 89796a25d4bb -r 6edb62505c69 mercurial/cache.py
> --- /dev/null   Thu Jan 01 00:00:00 1970 +0000
> +++ b/mercurial/cache.py        Fri Jul 07 22:14:01 2017 +0200
> @@ -0,0 +1,127 @@
> +# cache.py - utilities for caching
> +#
> +# Copyright 2017 Octobus SAS <contact@octobus.net>
> +#
> +# This software may be used and distributed according to the terms of the
> +# GNU General Public License version 2 or any later version.
> +from __future__ import absolute_import
> +
> +import abc
> +import struct
> +
> +from . import (
> +    util,
> +)
> +
> +class incrementalcachebase(object):
> +    """base class for incremental cache from append only source
> +
> +    There are multiple append only data source we might want to cache
> +    computation from. One of the common pattern is to track the state of
> the
> +    file and update the cache with the extra data (eg: branchmap-cache
> tracking
> +    changelog). This pattern also needs to detect when a the source is
> striped
> +
> +    The overall pattern is similar whatever the actual source is. This
> class
> +    introduces the basic patterns.
> +    """
> +
> +    __metaclass__ = abc.ABCMeta
> +
> +    # default key used for an empty cache
> +    emptykey = ()
> +
> +    _cachekeyspec = '' # used for serialization
> +    _cachename = None # used for debug message
> +
> +    @abc.abstractmethod
> +    def __init__(self):
> +        super(incrementalcachebase, self).__init__()
> +        self._cachekey = None
> +
> +    @util.propertycache
> +    def _cachekeystruct(self):
> +        # dynamic property to help subclass to change it
> +         return struct.Struct('>' + self._cachekeyspec)
> +
> +    @util.propertycache
> +    def _cachekeysize(self):
> +        # dynamic property to help subclass to change it
> +        return self._cachekeystruct.size
> +
> +    @abc.abstractmethod
> +    def _updatefrom(self, repo, data):
> +        """override this method to update you date from incrementally
> read data.
> +
> +        Content of <data> will depends of the sources.
> +        """
> +        raise NotImplementedError
> +
> +    @abc.abstractmethod
> +    def clear(self, reset=False):
> +        """invalidate the cache content
> +
> +        if 'reset' is passed, we detected a strip and the cache will have
> to be
> +        recomputed.
> +
> +        Subclasses MUST overide this method to actually affect the cache
> data.
> +        """
> +        if reset:
> +            self._cachekey = self.emptykey if reset else None
> +        else:
> +            self._cachekey = None
> +
> +    @abc.abstractmethod
> +    def load(self, repo):
> +        """Load data from disk
> +
> +        Subclasses MUST restore the "cachekey" attribute while doing so.
> +        """
> +        raise NotImplementedError
> +
> +    @abc.abstractmethod
> +    def _fetchupdatedata(self, repo):
> +        """Check the source for possible changes and return necessary data
> +
> +        The return is a tree elements tuple: reset, data, cachekey
> +
> +        * reset: `True` when a strip is detected and cache need to be
> reset
> +        * data: new data to take in account, actual type depends of the
> source
> +        * cachekey: the cache key covering <data> and precious covered
> data
> +        """
> +        raise NotImplementedError
> +
> +    # Useful "public" function (no need to override them)
> +
> +    def update(self, repo):
> +        """update the cache with new repository data
> +
> +        The update will be incremental when possible"""
> +        repo = repo.unfiltered()
> +
> +        # If we do not have any data, try loading from disk
> +        if self._cachekey is None:
> +            self.load(repo)
> +
> +        reset, data, newkey = self._fetchupdatedata(repo)
> +        if newkey == self._cachekey:
> +            return
> +        if reset or self._cachekey is None:
> +            repo.ui.log('cache', 'strip detected, %s cache reset\n'
> +                        % self._cachename)
> +            self.clear(reset=True)
> +
> +        starttime = util.timer()
> +        self._updatefrom(repo, data)
> +        duration = util.timer() - starttime
> +        repo.ui.log('cache', 'updated %s in %.4f seconds\n',
> +                    self._cachename, duration)
> +
> +        self._cachekey = newkey
> +
> +    def _serializecachekey(self):
> +        """provide a bytes version of the cachekey"""
> +        return self._cachekeystruct.pack(*self._cachekey)
> +
> +    def _deserializecachekey(self, data):
> +        """read the cachekey from bytes"""
> +        return self._cachekeystruct.unpack(data)
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Boris Feld - July 10, 2017, 1:22 p.m.
On Sun, 2017-07-09 at 13:32 -0700, Gregory Szorc wrote:
> On Sun, Jul 9, 2017 at 10:52 AM, Boris Feld <boris.feld@octobus.net>
> wrote:
> > # HG changeset patch
> > # User Boris Feld <boris.feld@octobus.net>
> > # Date 1499458441 -7200
> > #      Fri Jul 07 22:14:01 2017 +0200
> > # Node ID 6edb62505c697329de034c2fdc47befd5896f31f
> > # Parent  89796a25d4bb91fb418ad3e70faad2c586902ffb
> > # EXP-Topic obs-cache
> > cache: introduce an abstract class for cache we can upgrade
> > incrementally
> > 
> > Right now each class implements their own mechanism for validation,
> > and
> > update. We start introducing abstract class to ultimately allow
> > more
> > unification of the cache code.
> > 
> > The end goal of this series is to introduce a cache for some
> > obsolescence
> > property, not to actually implement the cache. However, taking
> > advantage of
> > adding a new cache to introduce the abstract class seems like a
> > win.
> 
> I don't like saying this, but given the amount of discussion these
> patches generated the last time they were proposed and given the
> proximity to the code freeze, I doubt these will land in 4.3.
> 
> Would you be willing to voluntarily withdrawal them from
> consideration for 4.3? Or if there are less contentious
> patches/functionality that you would like to land in 4.3, would you
> be willing to pair down the series to the simpler/less-contentious
> patches so we can focus on 4.3?

We are aware that these changes have generated a lot of discussions. We
tried to make the code more flexible in order to integrate better with
the others work in this area.

This series is not just about the performance win. We already have a
similar cache in the evolve extension but the quality of the
implementation is limited because it is done from an extensions. This
series introduces various mechanism to improve the consistency of the
cache (eg: patch 3, 4 and 5). Having this improvement in core helps us
to have more robust cache in the evolve extension. These caches are
important to our current work on the obsmarkers discovery. Having these
improvement available to users 3 months earlier (4.3 instead of 4.4) is
important here.

Having a clear abstract cache API also helps obsdiscovery progress
since it will be easier for us to play with different caching
strategies. We could focus more on improvements if we had patch 1 to 7
available.

That's why the first part of the series is quite independent to the
obsstore cache. The changesets mainly introduce a common API for cache
classes, grab two commits of Jun and introduce the various abstract
base class. We have a couple of target users of these in evolve already
and maybe they could be helpful for Jun work on radix link (We have
have simple version of radix link based on this cache approach if
needed).

We would be happy if at least the 7 first changesets could be queued
and integrated into the 4.3 release and I would be happy to discuss way
to make it happen.

The advantage of having obscache in the series, is that it provides a
simple user (140 lines) to all theses abstract caches (instead of
adding dead code). We can send a V2 that hide the obscache behind an
experimental flag. We can then drop the code if something better land
during for 4.4.

Jun, could you take a look at the 7 first changesets and give us your
feedback about the obsstoresourcebase class, is it helpful for your
work on radixlink?

>  
> > diff -r 89796a25d4bb -r 6edb62505c69 mercurial/cache.py
> > --- /dev/null   Thu Jan 01 00:00:00 1970 +0000
> > +++ b/mercurial/cache.py        Fri Jul 07 22:14:01 2017 +0200
> > @@ -0,0 +1,127 @@
> > +# cache.py - utilities for caching
> > +#
> > +# Copyright 2017 Octobus SAS <contact@octobus.net>
> > +#
> > +# This software may be used and distributed according to the terms
> > of the
> > +# GNU General Public License version 2 or any later version.
> > +from __future__ import absolute_import
> > +
> > +import abc
> > +import struct
> > +
> > +from . import (
> > +    util,
> > +)
> > +
> > +class incrementalcachebase(object):
> > +    """base class for incremental cache from append only source
> > +
> > +    There are multiple append only data source we might want to
> > cache
> > +    computation from. One of the common pattern is to track the
> > state of the
> > +    file and update the cache with the extra data (eg: branchmap-
> > cache tracking
> > +    changelog). This pattern also needs to detect when a the
> > source is striped
> > +
> > +    The overall pattern is similar whatever the actual source is.
> > This class
> > +    introduces the basic patterns.
> > +    """
> > +
> > +    __metaclass__ = abc.ABCMeta
> > +
> > +    # default key used for an empty cache
> > +    emptykey = ()
> > +
> > +    _cachekeyspec = '' # used for serialization
> > +    _cachename = None # used for debug message
> > +
> > +    @abc.abstractmethod
> > +    def __init__(self):
> > +        super(incrementalcachebase, self).__init__()
> > +        self._cachekey = None
> > +
> > +    @util.propertycache
> > +    def _cachekeystruct(self):
> > +        # dynamic property to help subclass to change it
> > +         return struct.Struct('>' + self._cachekeyspec)
> > +
> > +    @util.propertycache
> > +    def _cachekeysize(self):
> > +        # dynamic property to help subclass to change it
> > +        return self._cachekeystruct.size
> > +
> > +    @abc.abstractmethod
> > +    def _updatefrom(self, repo, data):
> > +        """override this method to update you date from
> > incrementally read data.
> > +
> > +        Content of <data> will depends of the sources.
> > +        """
> > +        raise NotImplementedError
> > +
> > +    @abc.abstractmethod
> > +    def clear(self, reset=False):
> > +        """invalidate the cache content
> > +
> > +        if 'reset' is passed, we detected a strip and the cache
> > will have to be
> > +        recomputed.
> > +
> > +        Subclasses MUST overide this method to actually affect the
> > cache data.
> > +        """
> > +        if reset:
> > +            self._cachekey = self.emptykey if reset else None
> > +        else:
> > +            self._cachekey = None
> > +
> > +    @abc.abstractmethod
> > +    def load(self, repo):
> > +        """Load data from disk
> > +
> > +        Subclasses MUST restore the "cachekey" attribute while
> > doing so.
> > +        """
> > +        raise NotImplementedError
> > +
> > +    @abc.abstractmethod
> > +    def _fetchupdatedata(self, repo):
> > +        """Check the source for possible changes and return
> > necessary data
> > +
> > +        The return is a tree elements tuple: reset, data, cachekey
> > +
> > +        * reset: `True` when a strip is detected and cache need to
> > be reset
> > +        * data: new data to take in account, actual type depends
> > of the source
> > +        * cachekey: the cache key covering <data> and precious
> > covered data
> > +        """
> > +        raise NotImplementedError
> > +
> > +    # Useful "public" function (no need to override them)
> > +
> > +    def update(self, repo):
> > +        """update the cache with new repository data
> > +
> > +        The update will be incremental when possible"""
> > +        repo = repo.unfiltered()
> > +
> > +        # If we do not have any data, try loading from disk
> > +        if self._cachekey is None:
> > +            self.load(repo)
> > +
> > +        reset, data, newkey = self._fetchupdatedata(repo)
> > +        if newkey == self._cachekey:
> > +            return
> > +        if reset or self._cachekey is None:
> > +            repo.ui.log('cache', 'strip detected, %s cache
> > reset\n'
> > +                        % self._cachename)
> > +            self.clear(reset=True)
> > +
> > +        starttime = util.timer()
> > +        self._updatefrom(repo, data)
> > +        duration = util.timer() - starttime
> > +        repo.ui.log('cache', 'updated %s in %.4f seconds\n',
> > +                    self._cachename, duration)
> > +
> > +        self._cachekey = newkey
> > +
> > +    def _serializecachekey(self):
> > +        """provide a bytes version of the cachekey"""
> > +        return self._cachekeystruct.pack(*self._cachekey)
> > +
> > +    def _deserializecachekey(self, data):
> > +        """read the cachekey from bytes"""
> > +        return self._cachekeystruct.unpack(data)
> > _______________________________________________
> > Mercurial-devel mailing list
> > Mercurial-devel@mercurial-scm.org
> > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
> > 
> 
>
Jun Wu - July 10, 2017, 4:44 p.m.
Excerpts from Boris Feld's message of 2017-07-10 15:22:19 +0200:
> Jun, could you take a look at the 7 first changesets and give us your
> feedback about the obsstoresourcebase class, is it helpful for your
> work on radixlink?

I think it's over complicated for my usecase. With the obsstore format
change, it could be 2 lines to detect strip, and another a few lines to use
length to incrementally build the index.

If both index and obsstore are native code for performance, then Python code
cannot be reused.
Sean Farley - July 11, 2017, 11:02 p.m.
Gregory Szorc <gregory.szorc@gmail.com> writes:

> On Sun, Jul 9, 2017 at 10:52 AM, Boris Feld <boris.feld@octobus.net> wrote:
>
>> # HG changeset patch
>> # User Boris Feld <boris.feld@octobus.net>
>> # Date 1499458441 -7200
>> #      Fri Jul 07 22:14:01 2017 +0200
>> # Node ID 6edb62505c697329de034c2fdc47befd5896f31f
>> # Parent  89796a25d4bb91fb418ad3e70faad2c586902ffb
>> # EXP-Topic obs-cache
>> cache: introduce an abstract class for cache we can upgrade incrementally
>>
>> Right now each class implements their own mechanism for validation, and
>> update. We start introducing abstract class to ultimately allow more
>> unification of the cache code.
>>
>> The end goal of this series is to introduce a cache for some obsolescence
>> property, not to actually implement the cache. However, taking advantage of
>> adding a new cache to introduce the abstract class seems like a win.
>>
>
> I don't like saying this, but given the amount of discussion these patches
> generated the last time they were proposed and given the proximity to the
> code freeze, I doubt these will land in 4.3.
>
> Would you be willing to voluntarily withdrawal them from consideration for
> 4.3? Or if there are less contentious patches/functionality that you would
> like to land in 4.3, would you be willing to pair down the series to the
> simpler/less-contentious patches so we can focus on 4.3?

I'm also fine to wait for 4.3 since the freeze is a week away. Taking
this in another direction, I have long wanted a generic cache object as
an extension author (both in remotenames and hg-git; but other
extensions as well). So, having a refactored cache class that both tags
and branches could use would indeed be a win (then maybe finding an
extension that could use it as well).

The obs part of this discussion can wait until after the freeze.

Patch

diff -r 89796a25d4bb -r 6edb62505c69 mercurial/cache.py
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/mercurial/cache.py	Fri Jul 07 22:14:01 2017 +0200
@@ -0,0 +1,127 @@ 
+# cache.py - utilities for caching
+#
+# Copyright 2017 Octobus SAS <contact@octobus.net>
+#
+# This software may be used and distributed according to the terms of the
+# GNU General Public License version 2 or any later version.
+from __future__ import absolute_import
+
+import abc
+import struct
+
+from . import (
+    util,
+)
+
+class incrementalcachebase(object):
+    """base class for incremental cache from append only source
+
+    There are multiple append only data source we might want to cache
+    computation from. One of the common pattern is to track the state of the
+    file and update the cache with the extra data (eg: branchmap-cache tracking
+    changelog). This pattern also needs to detect when a the source is striped
+
+    The overall pattern is similar whatever the actual source is. This class
+    introduces the basic patterns.
+    """
+
+    __metaclass__ = abc.ABCMeta
+
+    # default key used for an empty cache
+    emptykey = ()
+
+    _cachekeyspec = '' # used for serialization
+    _cachename = None # used for debug message
+
+    @abc.abstractmethod
+    def __init__(self):
+        super(incrementalcachebase, self).__init__()
+        self._cachekey = None
+
+    @util.propertycache
+    def _cachekeystruct(self):
+        # dynamic property to help subclass to change it
+         return struct.Struct('>' + self._cachekeyspec)
+
+    @util.propertycache
+    def _cachekeysize(self):
+        # dynamic property to help subclass to change it
+        return self._cachekeystruct.size
+
+    @abc.abstractmethod
+    def _updatefrom(self, repo, data):
+        """override this method to update you date from incrementally read data.
+
+        Content of <data> will depends of the sources.
+        """
+        raise NotImplementedError
+
+    @abc.abstractmethod
+    def clear(self, reset=False):
+        """invalidate the cache content
+
+        if 'reset' is passed, we detected a strip and the cache will have to be
+        recomputed.
+
+        Subclasses MUST overide this method to actually affect the cache data.
+        """
+        if reset:
+            self._cachekey = self.emptykey if reset else None
+        else:
+            self._cachekey = None
+
+    @abc.abstractmethod
+    def load(self, repo):
+        """Load data from disk
+
+        Subclasses MUST restore the "cachekey" attribute while doing so.
+        """
+        raise NotImplementedError
+
+    @abc.abstractmethod
+    def _fetchupdatedata(self, repo):
+        """Check the source for possible changes and return necessary data
+
+        The return is a tree elements tuple: reset, data, cachekey
+
+        * reset: `True` when a strip is detected and cache need to be reset
+        * data: new data to take in account, actual type depends of the source
+        * cachekey: the cache key covering <data> and precious covered data
+        """
+        raise NotImplementedError
+
+    # Useful "public" function (no need to override them)
+
+    def update(self, repo):
+        """update the cache with new repository data
+
+        The update will be incremental when possible"""
+        repo = repo.unfiltered()
+
+        # If we do not have any data, try loading from disk
+        if self._cachekey is None:
+            self.load(repo)
+
+        reset, data, newkey = self._fetchupdatedata(repo)
+        if newkey == self._cachekey:
+            return
+        if reset or self._cachekey is None:
+            repo.ui.log('cache', 'strip detected, %s cache reset\n'
+                        % self._cachename)
+            self.clear(reset=True)
+
+        starttime = util.timer()
+        self._updatefrom(repo, data)
+        duration = util.timer() - starttime
+        repo.ui.log('cache', 'updated %s in %.4f seconds\n',
+                    self._cachename, duration)
+
+        self._cachekey = newkey
+
+    def _serializecachekey(self):
+        """provide a bytes version of the cachekey"""
+        return self._cachekeystruct.pack(*self._cachekey)
+
+    def _deserializecachekey(self, data):
+        """read the cachekey from bytes"""
+        return self._cachekeystruct.unpack(data)