Patchwork [5,of,8] obscache: add a cache for 1/2 of the "obsolete" property

login
register
mail settings
Submitter Pierre-Yves David
Date May 19, 2017, 2:28 p.m.
Message ID <c2b1c81b3c4cba8cc21b.1495204084@nodosa.octopoid.net>
Download mbox | patch
Permalink /patch/20710/
State Changes Requested
Headers show

Comments

Pierre-Yves David - May 19, 2017, 2:28 p.m.
# HG changeset patch
# User Pierre-Yves David <pierre-yves.david@octobus.net>
# Date 1495197986 -7200
#      Fri May 19 14:46:26 2017 +0200
# Node ID c2b1c81b3c4cba8cc21b1a1b92f4e34e6aa1807a
# Parent  b5c984cd0640255b94f22d56a4bfe398b2c5de2e
# EXP-Topic obscache
# Available At https://www.mercurial-scm.org/repo/users/marmoute/mercurial/
#              hg pull https://www.mercurial-scm.org/repo/users/marmoute/mercurial/ -r c2b1c81b3c4c
obscache: add a cache for 1/2 of the "obsolete" property

Knowing if a changeset is obsolete requires two data:

1) is the change non-public,
2) is the changeset affected by any obsolescence markers.

The phase related data has some fast implementation already. However, the
obsmarkers based property currently requires to parse the whole obsstore, a slow
operation.

That information is monotonic (new changeset are not affected, and once they are
affected, they will be for ever), making it is easy to cache. We introduce a new
class dedicated to this information. That first implementation still needs to
parse the full obsstore when updating for the sake of simplicity. It will be
improved later to allow lighter upgrade.

The next changesets will put this new cache to use.

That code is coming from the evolve extension, were it matured. To keep this
changeset simple, there are a couple of improvement in the extension that will
be ported later.
Augie Fackler - May 19, 2017, 9:53 p.m.
On Fri, May 19, 2017 at 04:28:04PM +0200, Pierre-Yves David wrote:
> # HG changeset patch
> # User Pierre-Yves David <pierre-yves.david@octobus.net>
> # Date 1495197986 -7200
> #      Fri May 19 14:46:26 2017 +0200
> # Node ID c2b1c81b3c4cba8cc21b1a1b92f4e34e6aa1807a
> # Parent  b5c984cd0640255b94f22d56a4bfe398b2c5de2e
> # EXP-Topic obscache
> # Available At https://www.mercurial-scm.org/repo/users/marmoute/mercurial/
> #              hg pull https://www.mercurial-scm.org/repo/users/marmoute/mercurial/ -r c2b1c81b3c4c
> obscache: add a cache for 1/2 of the "obsolete" property

So, can you point me to other (upcoming) concrete subclasses of
dualsourcecache, or is this the only one?

Also, could we put these big new classes in obscache.py or some other
new file? I'm wary of adding more code to the already-1200-lines
obsolete.py, and since these are coming from evolve I'm hoping that
they're still easily separable from obsolete.py...

>
> Knowing if a changeset is obsolete requires two data:
>
> 1) is the change non-public,
> 2) is the changeset affected by any obsolescence markers.
>
> The phase related data has some fast implementation already. However, the
> obsmarkers based property currently requires to parse the whole obsstore, a slow
> operation.
>
> That information is monotonic (new changeset are not affected, and once they are
> affected, they will be for ever), making it is easy to cache. We introduce a new
> class dedicated to this information. That first implementation still needs to
> parse the full obsstore when updating for the sake of simplicity. It will be
> improved later to allow lighter upgrade.
>
> The next changesets will put this new cache to use.
>
> That code is coming from the evolve extension, were it matured. To keep this
> changeset simple, there are a couple of improvement in the extension that will
> be ported later.
>
> diff --git a/mercurial/obsolete.py b/mercurial/obsolete.py
> --- a/mercurial/obsolete.py
> +++ b/mercurial/obsolete.py
> @@ -1363,6 +1363,138 @@ class dualsourcecache(object):
>
>          return reset, revs, markers, (obssize, obskey)
>
> +class obscache(dualsourcecache):
> +    """cache "does a rev is the precursors of some obsmarkers" property
> +
> +    This is not directly holding the "is this revision obsolete" information,
> +    because phases data gets into play here. However, it allow to compute the
> +    "obsolescence" set without reading the obsstore content.
> +
> +    The cache use a bytearray to store that data and simply write it on disk
> +    for storage.
> +
> +    Implementation note #1:
> +
> +      The obsstore is implementing only half of the transaction logic it
> +      should. It properly record the starting point of the obsstore to allow
> +      clean rollback. However it still write to the obsstore file directly
> +      during the transaction. Instead it should be keeping data in memory and
> +      write to a '.pending' file to make the data vailable for hooks.
> +
> +      This cache is not going further than what the obsstore is doing, so it
> +      does not has any '.pending' logic. When the obsstore gains proper
> +      '.pending' support, adding it to this cache should not be too hard. As
> +      the flag always move from 0 to 1, we could have a second '.pending' cache
> +      file to be read. If flag is set in any of them, the value is 1. For the
> +      same reason, updating the file in place should be possible.
> +
> +    Implementation note #2:
> +
> +        Storage-wise, we could have a "start rev" to avoid storing useless
> +        zero. That would be especially useful for the '.pending' overlay.
> +    """
> +
> +    _filepath = 'cache/obscache-v01'
> +    _headerformat = '>q20sQQ20s'
> +
> +    _cachename = 'obscache' # used for error message
> +
> +    def __init__(self, repo):
> +        super(obscache, self).__init__()
> +        self._ondiskkey = None
> +        self._vfs = repo.vfs
> +        self._data = bytearray()
> +
> +    def get(self, rev):
> +        """return True if "rev" is used as "precursors for any obsmarkers
> +
> +        IMPORTANT: make sure the cache has been updated to match the repository
> +        content before using it"""
> +        return self._data[rev]
> +
> +    def clear(self, reset=False):
> +        """invalidate the in memory cache content"""
> +        super(obscache, self).clear(reset=reset)
> +        self._data = bytearray()
> +
> +    def _updatefrom(self, repo, revs, obsmarkers):
> +        if revs:
> +            self._updaterevs(repo, revs)
> +        if obsmarkers:
> +            self._updatemarkers(repo, obsmarkers)
> +
> +    def _updaterevs(self, repo, revs):
> +        """update the cache with new revisions
> +
> +        Newly added changesets might be affected by obsolescence markers we
> +        already have locally. So we needs to have some global knowledge about
> +        the markers to handle that question.
> +
> +        XXX performance note:
> +
> +        Right now this requires parsing all markers in the obsstore. We could
> +        imagine using various optimisation (eg: another cache, network
> +        exchange, etc).
> +
> +        A possible approach to this is to build a set of all nodes used as
> +        precursors in `obsstore._obscandidate`. If markers are not loaded yet,
> +        we could initialize it by doing a quick scan through the obsstore data
> +        and filling a (pre-sized) set. Doing so would be much faster than
> +        parsing all the obsmarkers since we would access less data, not create
> +        any object beside the nodes and not have to decode any complex data.
> +
> +        For now we stick to the simpler approach of paying the
> +        performance cost on new changesets.
> +        """
> +        node = repo.changelog.node
> +        succs = repo.obsstore.successors
> +        for r in revs:
> +            val = int(node(r) in succs)
> +            self._data.append(val)
> +        cl = repo.changelog
> +        assert len(self._data) == len(cl), (len(self._data), len(cl))
> +
> +    def _updatemarkers(self, repo, obsmarkers):
> +        """update the cache with new markers"""
> +        rev = repo.changelog.nodemap.get
> +        for m in obsmarkers:
> +            r = rev(m[0])
> +            if r is not None:
> +                self._data[r] = 1
> +
> +    def save(self, repo):
> +        """save the data to disk
> +
> +        Format is pretty simple, we serialise the cache key and then drop the
> +        bytearray.
> +        """
> +
> +        # XXX we'll need some pending related logic when the obsstore get it
> +
> +        if self._cachekey is None or self._cachekey == self._ondiskkey:
> +            return
> +
> +        cachefile = repo.vfs(self._filepath, 'w', atomictemp=True)
> +        headerdata = struct.pack(self._headerformat, *self._cachekey)
> +        cachefile.write(headerdata)
> +        cachefile.write(self._data)
> +        cachefile.close()
> +
> +    def load(self, repo):
> +        """load data from disk"""
> +        assert repo.filtername is None
> +
> +        data = repo.vfs.tryread(self._filepath)
> +        if not data:
> +            self._cachekey = self.emptykey
> +            self._data = bytearray()
> +        else:
> +            headersize = struct.calcsize(self._headerformat)
> +            self._cachekey = struct.unpack(self._headerformat,
> +                                           data[:headersize])
> +            self._data = bytearray(data[headersize:])
> +        self._ondiskkey = self._cachekey
> +
>  # mapping of 'set-name' -> <function to compute this set>
>  cachefuncs = {}
>  def cachefor(name):
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Pierre-Yves David - May 19, 2017, 10:15 p.m.
On 05/19/2017 11:53 PM, Augie Fackler wrote:
> On Fri, May 19, 2017 at 04:28:04PM +0200, Pierre-Yves David wrote:
>> # HG changeset patch
>> # User Pierre-Yves David <pierre-yves.david@octobus.net>
>> # Date 1495197986 -7200
>> #      Fri May 19 14:46:26 2017 +0200
>> # Node ID c2b1c81b3c4cba8cc21b1a1b92f4e34e6aa1807a
>> # Parent  b5c984cd0640255b94f22d56a4bfe398b2c5de2e
>> # EXP-Topic obscache
>> # Available At https://www.mercurial-scm.org/repo/users/marmoute/mercurial/
>> #              hg pull https://www.mercurial-scm.org/repo/users/marmoute/mercurial/ -r c2b1c81b3c4c
>> obscache: add a cache for 1/2 of the "obsolete" property
>
> So, can you point me to other (upcoming) concrete subclasses of
> dualsourcecache, or is this the only one?

There are current a second user of this abstract class in evolve. It is 
used for obsdiscovery[1]. And I expect to we'll experiment with more of 
this class of cache in the near future.

[1] 
https://www.mercurial-scm.org/repo/evolve/file/71d242e0fc1a/hgext3rd/evolve/obsdiscovery.py#l440

The logic in the cache is easy to screw up and the end result is really 
need to use (implement save/load/reset/update(revs, obs)). Having a 
class ready to use is -really- handy.

In addition, there are final stage of skip-obsstore optimization that 
requires some rework and collaboration in core something hard to access 
from an extension.

So, having this intermediate abstract class is a corner-stone of the 
current contribution. (most of work in evolve 6.2 is actually about 
extracting and reusing this logic)

> Also, could we put these big new classes in obscache.py or some other
> new file? I'm wary of adding more code to the already-1200-lines
> obsolete.py, and since these are coming from evolve I'm hoping that
> they're still easily separable from obsolete.py...

This crossed my mind when adding it. However, there are significant 
improvement to that class arriving soon. These improvement will need to 
access (and refactor) some lower logic around reading markers and al. I 
think it is wiser to wait for the dust to settle after these upgrade and 
cut where this is

(the version in evolve access some "_private" method and duplicates some 
code)

Cheers,
Augie Fackler - May 19, 2017, 10:32 p.m.
> On May 19, 2017, at 15:15, Pierre-Yves David <pierre-yves.david@ens-lyon.org> wrote:
> 
> 
> 
> On 05/19/2017 11:53 PM, Augie Fackler wrote:
>> On Fri, May 19, 2017 at 04:28:04PM +0200, Pierre-Yves David wrote:
>>> # HG changeset patch
>>> # User Pierre-Yves David <pierre-yves.david@octobus.net>
>>> # Date 1495197986 -7200
>>> #      Fri May 19 14:46:26 2017 +0200
>>> # Node ID c2b1c81b3c4cba8cc21b1a1b92f4e34e6aa1807a
>>> # Parent  b5c984cd0640255b94f22d56a4bfe398b2c5de2e
>>> # EXP-Topic obscache
>>> # Available At https://www.mercurial-scm.org/repo/users/marmoute/mercurial/
>>> #              hg pull https://www.mercurial-scm.org/repo/users/marmoute/mercurial/ -r c2b1c81b3c4c
>>> obscache: add a cache for 1/2 of the "obsolete" property
>> 
>> So, can you point me to other (upcoming) concrete subclasses of
>> dualsourcecache, or is this the only one?
> 
> There are current a second user of this abstract class in evolve. It is used for obsdiscovery[1]. And I expect to we'll experiment with more of this class of cache in the near future.
> 
> [1] https://www.mercurial-scm.org/repo/evolve/file/71d242e0fc1a/hgext3rd/evolve/obsdiscovery.py#l440
> 
> The logic in the cache is easy to screw up and the end result is really need to use (implement save/load/reset/update(revs, obs)). Having a class ready to use is -really- handy.

I wish it could be done without the use of inheritance, but it's fine-ish for now as an abstract class. Should it perhaps be using the abc module to be a more explicit base class?

(I haven't ever used abc, so "no" or "not now" is fine)

> In addition, there are final stage of skip-obsstore optimization that requires some rework and collaboration in core something hard to access from an extension.
> 
> So, having this intermediate abstract class is a corner-stone of the current contribution. (most of work in evolve 6.2 is actually about extracting and reusing this logic)
> 
>> Also, could we put these big new classes in obscache.py or some other
>> new file? I'm wary of adding more code to the already-1200-lines
>> obsolete.py, and since these are coming from evolve I'm hoping that
>> they're still easily separable from obsolete.py...
> 
> This crossed my mind when adding it. However, there are significant improvement to that class arriving soon. These improvement will need to access (and refactor) some lower logic around reading markers and al. I think it is wiser to wait for the dust to settle after these upgrade and cut where this is

I'm not really satisfied with this answer. Can you split things up into a better organization scheme so we don't end up with more of a mess in this already far-too-large module? That'd make things easier for me to reason about.

> (the version in evolve access some "_private" method and duplicates some code)

^ this is a sign to me that some more refactoring should probably happen before this lands, as much as I want it now.

> 
> Cheers,
> 
> -- 
> Pierre-Yves David
Pierre-Yves David - May 20, 2017, 3:20 p.m.
On 05/20/2017 12:32 AM, Augie Fackler wrote:
>> On May 19, 2017, at 15:15, Pierre-Yves David <pierre-yves.david@ens-lyon.org> wrote:
>>
>> On 05/19/2017 11:53 PM, Augie Fackler wrote:
>>> On Fri, May 19, 2017 at 04:28:04PM +0200, Pierre-Yves David wrote:
>>>> # HG changeset patch
>>>> # User Pierre-Yves David <pierre-yves.david@octobus.net>
>>>> # Date 1495197986 -7200
>>>> #      Fri May 19 14:46:26 2017 +0200
>>>> # Node ID c2b1c81b3c4cba8cc21b1a1b92f4e34e6aa1807a
>>>> # Parent  b5c984cd0640255b94f22d56a4bfe398b2c5de2e
>>>> # EXP-Topic obscache
>>>> # Available At https://www.mercurial-scm.org/repo/users/marmoute/mercurial/
>>>> #              hg pull https://www.mercurial-scm.org/repo/users/marmoute/mercurial/ -r c2b1c81b3c4c
>>>> obscache: add a cache for 1/2 of the "obsolete" property
>>>
>>> So, can you point me to other (upcoming) concrete subclasses of
>>> dualsourcecache, or is this the only one?
>>
>> There are current a second user of this abstract class in evolve. It is used for obsdiscovery[1]. And I expect to we'll experiment with more of this class of cache in the near future.
>>
>> [1] https://www.mercurial-scm.org/repo/evolve/file/71d242e0fc1a/hgext3rd/evolve/obsdiscovery.py#l440
>>
>> The logic in the cache is easy to screw up and the end result is really need to use (implement save/load/reset/update(revs, obs)). Having a class ready to use is -really- handy.
>
> I wish it could be done without the use of inheritance, but it's fine-ish for now as an abstract class. Should it perhaps be using the abc module to be a more explicit base class?
>
> (I haven't ever used abc, so "no" or "not now" is fine)

That looks like a good idea, sold!

>> In addition, there are final stage of skip-obsstore optimization that requires some rework and collaboration in core something hard to access from an extension.
>>
>> So, having this intermediate abstract class is a corner-stone of the current contribution. (most of work in evolve 6.2 is actually about extracting and reusing this logic)
>>
>>> Also, could we put these big new classes in obscache.py or some other
>>> new file? I'm wary of adding more code to the already-1200-lines
>>> obsolete.py, and since these are coming from evolve I'm hoping that
>>> they're still easily separable from obsolete.py...
>>
>> This crossed my mind when adding it. However, there are significant improvement to that class arriving soon. These improvement will need to access (and refactor) some lower logic around reading markers and al. I think it is wiser to wait for the dust to settle after these upgrade and cut where this is
>
> I'm not really satisfied with this answer. Can you split things up into a better organization scheme so we don't end up with more of a mess in this already far-too-large module? That'd make things easier for me to reason about.

Since, I did not used the "not now" voucher for the question above. I 
would like to it here :-). That module can be split up, but not now.

This series is about improving caching. It lands new cache related code 
right next to other cache related code. The module grow bigger but not 
messier.

Splitting this module (likely by creating a obsutil.py module for 
utility functions) will require proper work (redirecting internal user, 
adding deprecation for external user of the util, etc). That is a 
significant unrelated scope bloat that is unlikely to affect the review 
of this series.

There are nothing urgent in splitting this module, it is big, but not 
massive by our current standard (there are about25 modules larger). I've 
already split and cleaned many modules so you can trust me to keep a 
good balance here.

I'll send a V2 to make the discussion move forward on the code.

>> (the version in evolve access some "_private" method and duplicates some code)
>
> ^ this is a sign to me that some more refactoring should probably happen before this lands, as much as I want it now.

I just see it as a sign there are advance experiments going on in the 
evolve extension.

Cheers,

Patch

diff --git a/mercurial/obsolete.py b/mercurial/obsolete.py
--- a/mercurial/obsolete.py
+++ b/mercurial/obsolete.py
@@ -1363,6 +1363,138 @@  class dualsourcecache(object):
 
         return reset, revs, markers, (obssize, obskey)
 
+class obscache(dualsourcecache):
+    """cache "does a rev is the precursors of some obsmarkers" property
+
+    This is not directly holding the "is this revision obsolete" information,
+    because phases data gets into play here. However, it allow to compute the
+    "obsolescence" set without reading the obsstore content.
+
+    The cache use a bytearray to store that data and simply write it on disk
+    for storage.
+
+    Implementation note #1:
+
+      The obsstore is implementing only half of the transaction logic it
+      should. It properly record the starting point of the obsstore to allow
+      clean rollback. However it still write to the obsstore file directly
+      during the transaction. Instead it should be keeping data in memory and
+      write to a '.pending' file to make the data vailable for hooks.
+
+      This cache is not going further than what the obsstore is doing, so it
+      does not has any '.pending' logic. When the obsstore gains proper
+      '.pending' support, adding it to this cache should not be too hard. As
+      the flag always move from 0 to 1, we could have a second '.pending' cache
+      file to be read. If flag is set in any of them, the value is 1. For the
+      same reason, updating the file in place should be possible.
+
+    Implementation note #2:
+
+        Storage-wise, we could have a "start rev" to avoid storing useless
+        zero. That would be especially useful for the '.pending' overlay.
+    """
+
+    _filepath = 'cache/obscache-v01'
+    _headerformat = '>q20sQQ20s'
+
+    _cachename = 'obscache' # used for error message
+
+    def __init__(self, repo):
+        super(obscache, self).__init__()
+        self._ondiskkey = None
+        self._vfs = repo.vfs
+        self._data = bytearray()
+
+    def get(self, rev):
+        """return True if "rev" is used as "precursors for any obsmarkers
+
+        IMPORTANT: make sure the cache has been updated to match the repository
+        content before using it"""
+        return self._data[rev]
+
+    def clear(self, reset=False):
+        """invalidate the in memory cache content"""
+        super(obscache, self).clear(reset=reset)
+        self._data = bytearray()
+
+    def _updatefrom(self, repo, revs, obsmarkers):
+        if revs:
+            self._updaterevs(repo, revs)
+        if obsmarkers:
+            self._updatemarkers(repo, obsmarkers)
+
+    def _updaterevs(self, repo, revs):
+        """update the cache with new revisions
+
+        Newly added changesets might be affected by obsolescence markers we
+        already have locally. So we needs to have some global knowledge about
+        the markers to handle that question.
+
+        XXX performance note:
+
+        Right now this requires parsing all markers in the obsstore. We could
+        imagine using various optimisation (eg: another cache, network
+        exchange, etc).
+
+        A possible approach to this is to build a set of all nodes used as
+        precursors in `obsstore._obscandidate`. If markers are not loaded yet,
+        we could initialize it by doing a quick scan through the obsstore data
+        and filling a (pre-sized) set. Doing so would be much faster than
+        parsing all the obsmarkers since we would access less data, not create
+        any object beside the nodes and not have to decode any complex data.
+
+        For now we stick to the simpler approach of paying the
+        performance cost on new changesets.
+        """
+        node = repo.changelog.node
+        succs = repo.obsstore.successors
+        for r in revs:
+            val = int(node(r) in succs)
+            self._data.append(val)
+        cl = repo.changelog
+        assert len(self._data) == len(cl), (len(self._data), len(cl))
+
+    def _updatemarkers(self, repo, obsmarkers):
+        """update the cache with new markers"""
+        rev = repo.changelog.nodemap.get
+        for m in obsmarkers:
+            r = rev(m[0])
+            if r is not None:
+                self._data[r] = 1
+
+    def save(self, repo):
+        """save the data to disk
+
+        Format is pretty simple, we serialise the cache key and then drop the
+        bytearray.
+        """
+
+        # XXX we'll need some pending related logic when the obsstore get it
+
+        if self._cachekey is None or self._cachekey == self._ondiskkey:
+            return
+
+        cachefile = repo.vfs(self._filepath, 'w', atomictemp=True)
+        headerdata = struct.pack(self._headerformat, *self._cachekey)
+        cachefile.write(headerdata)
+        cachefile.write(self._data)
+        cachefile.close()
+
+    def load(self, repo):
+        """load data from disk"""
+        assert repo.filtername is None
+
+        data = repo.vfs.tryread(self._filepath)
+        if not data:
+            self._cachekey = self.emptykey
+            self._data = bytearray()
+        else:
+            headersize = struct.calcsize(self._headerformat)
+            self._cachekey = struct.unpack(self._headerformat,
+                                           data[:headersize])
+            self._data = bytearray(data[headersize:])
+        self._ondiskkey = self._cachekey
+
 # mapping of 'set-name' -> <function to compute this set>
 cachefuncs = {}
 def cachefor(name):