Patchwork [5,of,5,RFC] obsolete: store obsolete revisions to cache file

login
register
mail settings
Submitter Yuya Nishihara
Date Sept. 23, 2015, 5:48 a.m.
Message ID <1e78698ebd6c791fbca9.1442987286@mimosa>
Download mbox | patch
Permalink /patch/10592/
State Changes Requested
Headers show

Comments

Yuya Nishihara - Sept. 23, 2015, 5:48 a.m.
# HG changeset patch
# User Yuya Nishihara <yuya@tcha.org>
# Date 1441508842 -32400
#      Sun Sep 06 12:07:22 2015 +0900
# Node ID 1e78698ebd6c791fbca9da6bf56b65ef75f21d43
# Parent  35de00be99c2536643df4e100c7784d39829614b
obsolete: store obsolete revisions to cache file

This saves ~300msec for the repository that has 83731 markers (8.4MB) in total,
but has 5404 revisions marked as obsolete (22kB).

  $ time hg log -G -l10 > /dev/null
  before: 0.474sec
  after:  0.121sec

We could cache (rev, marker-offset) pair so that uninterested markers can
be skipped at the parsing state, but it would require further design change.
Pierre-Yves David - Sept. 23, 2015, 8:40 a.m.
On 09/22/2015 10:48 PM, Yuya Nishihara wrote:
> # HG changeset patch
> # User Yuya Nishihara <yuya@tcha.org>
> # Date 1441508842 -32400
> #      Sun Sep 06 12:07:22 2015 +0900
> # Node ID 1e78698ebd6c791fbca9da6bf56b65ef75f21d43
> # Parent  35de00be99c2536643df4e100c7784d39829614b
> obsolete: store obsolete revisions to cache file
>
> This saves ~300msec for the repository that has 83731 markers (8.4MB) in total,
> but has 5404 revisions marked as obsolete (22kB).
>
>    $ time hg log -G -l10 > /dev/null
>    before: 0.474sec
>    after:  0.121sec
>
> We could cache (rev, marker-offset) pair so that uninterested markers can
> be skipped at the parsing state, but it would require further design change.


There is some point I would like to see addressed in the descrition:

- What's the filename (or naming scheme)
- What are actually storing rev? or hash?
- What is the validation strategy and why is it correct.


On this "cache" topic. I was hoping to get a caching structure that 
allow a key to have a growing number of value. It would have multiple 
applications including here. Caching the precursors-rev -> markers 
mapping. Since the keys are dynamic, we can probably use a fixed size 
index + extra data somewhere. With such setup, reading the index would 
give the same information you are trying to cache here. If we cache on 
'rev' it would even give it in O(1) fashion (similar to the current 
branch cache)

I'm not saying we should go for a simpler approach first, but I'm 
brain-dumping these idea here to see if someone get inspired.
Yuya Nishihara - Sept. 24, 2015, 3:10 p.m.
On Wed, 23 Sep 2015 01:40:12 -0700, Pierre-Yves David wrote:
> On 09/22/2015 10:48 PM, Yuya Nishihara wrote:
> > # HG changeset patch
> > # User Yuya Nishihara <yuya@tcha.org>
> > # Date 1441508842 -32400
> > #      Sun Sep 06 12:07:22 2015 +0900
> > # Node ID 1e78698ebd6c791fbca9da6bf56b65ef75f21d43
> > # Parent  35de00be99c2536643df4e100c7784d39829614b
> > obsolete: store obsolete revisions to cache file
> >
> > This saves ~300msec for the repository that has 83731 markers (8.4MB) in total,
> > but has 5404 revisions marked as obsolete (22kB).
> >
> >    $ time hg log -G -l10 > /dev/null
> >    before: 0.474sec
> >    after:  0.121sec
> >
> > We could cache (rev, marker-offset) pair so that uninterested markers can
> > be skipped at the parsing state, but it would require further design change.
> 
> There is some point I would like to see addressed in the descrition:
> 
> - What's the filename (or naming scheme)

"cache/obsrevs-{'b'ig or 'l'ittle}{version}"

It has endian flag in case a repository is shared across Power and x86
machines. version isn't important.

> - What are actually storing rev? or hash?

It stores revision numbers that are marked as obsolete.
It seems getting revisions from thousands of hashes is slow.

> - What is the validation strategy and why is it correct.

SHA1 hash of {all repository heads + the last obsolete marker}, and offset
to the last obsolete marker.

Because revision numbers can change on strip, hashes of all heads are
included. This is the same strategy as cache/hidden, I think.

The difference is: in order to detect change of obsstore, it uses the
hash of the last marker and the offset to that marker. cache/hidden uses
the hash of obsolete revisions, which is what this patch tries to avoid
loading from obsstore.

It can detect added or truncated markers as follows:

 a) if marker added, hash(obsstore[lastmarkerofs:]) != cachedhash
 b) if marker truncated, obsstore[lastmarkerofs:] == '',
    so hash(obsstore[lastmarkerofs:]) != cachedhash
 c) edge case, if marker added/truncated by another process in the middle
    of reading/writing cache, lastmarkerofs won't be calculated correctly,
    so hash(obsstore[lastmarkerofs:]) != cachedhash

But it can't detect changes if different marker is inserted after truncating,
and the identical marker is appended at the exactly same offset.

                  lastmarkerofs
                    v
 original:   |......|last|   (cached)
 truncated:  |..|            (by old hg, cache not updated)
 insert new: |..|new|        (by old hg, cache not updated)
 append:     |..|new|last|
                 ^^^
              can't detect this change because it isn't included in the hash

We could calculate the hash of the whole obsstore, but it would be slow.

> On this "cache" topic. I was hoping to get a caching structure that 
> allow a key to have a growing number of value. It would have multiple 
> applications including here. Caching the precursors-rev -> markers 
> mapping. Since the keys are dynamic, we can probably use a fixed size 
> index + extra data somewhere. With such setup, reading the index would 
> give the same information you are trying to cache here. If we cache on 
> 'rev' it would even give it in O(1) fashion (similar to the current 
> branch cache)

Good point. I was thinking about that sort of caching mechanism. But before
starting, I wanted to see how fast it would be, and the result is this patch.

I'll try something like your suggestion.

Regards,
Matt Mackall - Sept. 29, 2015, 10:04 p.m.
On Fri, 2015-09-25 at 00:10 +0900, Yuya Nishihara wrote:
> On Wed, 23 Sep 2015 01:40:12 -0700, Pierre-Yves David wrote:
> > On 09/22/2015 10:48 PM, Yuya Nishihara wrote:
> > > # HG changeset patch
> > > # User Yuya Nishihara <yuya@tcha.org>
> > > # Date 1441508842 -32400
> > > #      Sun Sep 06 12:07:22 2015 +0900
> > > # Node ID 1e78698ebd6c791fbca9da6bf56b65ef75f21d43
> > > # Parent  35de00be99c2536643df4e100c7784d39829614b
> > > obsolete: store obsolete revisions to cache file
> > >
> > > This saves ~300msec for the repository that has 83731 markers (8.4MB) in total,
> > > but has 5404 revisions marked as obsolete (22kB).
> > >
> > >    $ time hg log -G -l10 > /dev/null
> > >    before: 0.474sec
> > >    after:  0.121sec
> > >
> > > We could cache (rev, marker-offset) pair so that uninterested markers can
> > > be skipped at the parsing state, but it would require further design change.
> > 
> > There is some point I would like to see addressed in the descrition:
> > 
> > - What's the filename (or naming scheme)
> 
> "cache/obsrevs-{'b'ig or 'l'ittle}{version}"
> 
> It has endian flag in case a repository is shared across Power and x86
> machines. version isn't important.

I'd prefer to always store big-endian. Endian-conversion is cheap and
intentionally choosing big-endian means we're slightly more likely to
get portability right when developing on x86 because we eliminate the
write-native-read-big or write-big-read-native cases without needing to
test on ARM/etc.
Yuya Nishihara - Sept. 30, 2015, 3:20 p.m.
On Tue, 29 Sep 2015 17:04:04 -0500, Matt Mackall wrote:
> On Fri, 2015-09-25 at 00:10 +0900, Yuya Nishihara wrote:
> > On Wed, 23 Sep 2015 01:40:12 -0700, Pierre-Yves David wrote:
> > > On 09/22/2015 10:48 PM, Yuya Nishihara wrote:
> > > > # HG changeset patch
> > > > # User Yuya Nishihara <yuya@tcha.org>
> > > > # Date 1441508842 -32400
> > > > #      Sun Sep 06 12:07:22 2015 +0900
> > > > # Node ID 1e78698ebd6c791fbca9da6bf56b65ef75f21d43
> > > > # Parent  35de00be99c2536643df4e100c7784d39829614b
> > > > obsolete: store obsolete revisions to cache file
> > > >
> > > > This saves ~300msec for the repository that has 83731 markers (8.4MB) in total,
> > > > but has 5404 revisions marked as obsolete (22kB).
> > > >
> > > >    $ time hg log -G -l10 > /dev/null
> > > >    before: 0.474sec
> > > >    after:  0.121sec
> > > >
> > > > We could cache (rev, marker-offset) pair so that uninterested markers can
> > > > be skipped at the parsing state, but it would require further design change.
> > > 
> > > There is some point I would like to see addressed in the descrition:
> > > 
> > > - What's the filename (or naming scheme)
> > 
> > "cache/obsrevs-{'b'ig or 'l'ittle}{version}"
> > 
> > It has endian flag in case a repository is shared across Power and x86
> > machines. version isn't important.
> 
> I'd prefer to always store big-endian. Endian-conversion is cheap and
> intentionally choosing big-endian means we're slightly more likely to
> get portability right when developing on x86 because we eliminate the
> write-native-read-big or write-big-read-native cases without needing to
> test on ARM/etc.

Hmm, makes sense. Endian-conversion won't be slower than creating a Python
int. Also, x86 provides bswap instruction if we really need it.

OT: ARM is typically used in little-endian mode. I think well-known big-
endian microprocessor is Atmel AVR?
Augie Fackler - Sept. 30, 2015, 4:22 p.m.
On Thu, Oct 01, 2015 at 12:20:57AM +0900, Yuya Nishihara wrote:
> On Tue, 29 Sep 2015 17:04:04 -0500, Matt Mackall wrote:
> > On Fri, 2015-09-25 at 00:10 +0900, Yuya Nishihara wrote:
> > > On Wed, 23 Sep 2015 01:40:12 -0700, Pierre-Yves David wrote:
> > I'd prefer to always store big-endian. Endian-conversion is cheap and
> > intentionally choosing big-endian means we're slightly more likely to
> > get portability right when developing on x86 because we eliminate the
> > write-native-read-big or write-big-read-native cases without needing to
> > test on ARM/etc.
>
> Hmm, makes sense. Endian-conversion won't be slower than creating a Python
> int. Also, x86 provides bswap instruction if we really need it.

+1.

>
> OT: ARM is typically used in little-endian mode. I think well-known big-
> endian microprocessor is Atmel AVR?

The one I'm most familiar with is POWER and PowerPC, though POWER8
adds enough support that you can run it in little-endian mode.

> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> https://selenic.com/mailman/listinfo/mercurial-devel

Patch

diff --git a/mercurial/obsolete.py b/mercurial/obsolete.py
--- a/mercurial/obsolete.py
+++ b/mercurial/obsolete.py
@@ -67,7 +67,7 @@  The header is followed by the markers. M
 comment associated with each format for details.
 
 """
-import errno, struct
+import array, errno, struct, sys
 import util, base85, node, parsers
 import phases
 from i18n import _
@@ -1059,6 +1059,101 @@  def successorssets(repo, initialnode, ca
                 cache[current] = final
     return cache[initialnode]
 
+_revscacheversion = 1
+_revscachefile = 'cache/obsrevs-%c%d' % (sys.byteorder[0], _revscacheversion)
+_revscacheheader = struct.Struct('=I20s')
+
+def _revscachehash(repo, lastmarkerdata):
+    """Calculate hash of repository heads and obsolete markers
+
+    Because the obsstore tends to be large, it would be expensive to calculate
+    the hash of the full obsstore data. So the hash is calculated from the last
+    marker assuming that the obsstore is append-only storage.
+
+             latmarkerofs     eof
+        --/----+----------------+
+          \    | lastmarkerdata |
+        --/----+----------------+
+        new marker added:
+        --/----+----------------+----------------+
+          \    | lastmarkerdata + newmarkerdata  |
+        --/----+---------------------------------+
+        rolled back:
+        -+
+         |  (lastmarkerdata = '')
+        -+
+
+    This means the following case cannot be handled. This happens if rollback
+    is done by old hg that isn't aware of the cache. New hg should rebuild
+    the cache after rollback.
+
+        rolled back, new marker added but the last chunk is identical:
+        -+/----+----------------+
+         |\ new| lastmarkerdata |
+        -+/----+----------------+
+    """
+    h = util.sha1()
+    for n in repo.heads():
+        h.update(n)
+    h.update(lastmarkerdata)
+    return h.digest()
+
+def _tryreadrevscache(ui, repo):
+    """Read revisions from cache file if it is valid; otherwise returns None
+
+    The cache consists of the following fields. All integers are stored in
+    native byte order.
+      4 bytes   offset to the last obsolete marker in obsstore file
+     20 bytes   hash of repository state when the cache was written
+    n*4 bytes   revision numbers
+    """
+    cachef = storef = None
+    try:
+        cachef = repo.vfs.open(_revscachefile, 'rb')
+        storef = repo.svfs.open('obsstore', 'rb')
+        h = cachef.read(_revscacheheader.size)
+        lastmarkerofs, chash = _revscacheheader.unpack(h)
+        storef.seek(lastmarkerofs)
+        lastmarkerdata = storef.read()
+        if chash != _revscachehash(repo, lastmarkerdata):
+            ui.log('obsrevscache', 'ignore obsrevs cache: hash differs\n')
+            return None
+        revsarray = array.array('i')
+        revsarray.fromstring(cachef.read())
+        ui.log('obsrevscache', 'read %d obsrevs from cache\n', len(revsarray))
+        return revsarray
+    except (IOError, OSError, OverflowError, struct.error):
+        ui.log('obsrevscache', 'failed to read obsrevs from cache\n')
+        return None
+    finally:
+        if cachef:
+            cachef.close()
+        if storef:
+            storef.close()
+
+def _trywriterevscache(ui, repo, revs):
+    """Write revision to cache file"""
+    obsstore = repo.obsstore
+    m = obsstore._all[-1:]
+    lastmarkerdata = ''.join(encodemarkers(m, version=obsstore._version))
+    try:
+        lastmarkerofs = repo.svfs.stat('obsstore').st_size - len(lastmarkerdata)
+    except OSError:
+        return
+    if lastmarkerofs < 1:
+        return
+    chash = _revscachehash(repo, lastmarkerdata)
+    revsarray = array.array('i')
+    revsarray.fromlist(revs)
+    try:
+        cachef = repo.vfs.open(_revscachefile, 'wb', atomictemp=True)
+        cachef.write(_revscacheheader.pack(lastmarkerofs, chash))
+        cachef.write(revsarray.tostring())
+        cachef.close()
+        ui.log('obsrevscache', 'wrote %d obsrevs to cache\n', len(revsarray))
+    except (IOError, OSError):
+        ui.log('obsrevscache', 'failed to write obsrevs to cache\n')
+
 # mapping of 'set-name' -> <function to compute this set>
 cachefuncs = {}
 def cachefor(name):
@@ -1111,12 +1206,16 @@  def _computeobsoleteset(repo):
 def _computerawobsoletelist(repo):
     """the list of revisions marked as obsolete, which may include public
     revisions"""
+    obs = _tryreadrevscache(repo.ui, repo)
+    if obs is not None:
+        return obs
     obs = []
     getrev = repo.changelog.nodemap.get
     for n in repo.obsstore.successors:
         rev = getrev(n)
         if rev is not None:
             obs.append(rev)
+    _trywriterevscache(repo.ui, repo, obs)
     return obs
 
 @cachefor('unstable')
diff --git a/tests/test-obsolete-tag-cache.t b/tests/test-obsolete-tag-cache.t
--- a/tests/test-obsolete-tag-cache.t
+++ b/tests/test-obsolete-tag-cache.t
@@ -4,6 +4,9 @@ 
   > rebase=
   > mock=$TESTDIR/mockblackbox.py
   > 
+  > [blackbox]
+  > track = command, commandfinish, tagscache
+  > 
   > [experimental]
   > evolution = createmarkers
   > EOF
diff --git a/tests/test-obsolete.t b/tests/test-obsolete.t
--- a/tests/test-obsolete.t
+++ b/tests/test-obsolete.t
@@ -8,6 +8,10 @@ 
   > # drop me once bundle2 is the default,
   > # added to get test change early.
   > bundle2-exp = True
+  > [extensions]
+  > blackbox =
+  > [blackbox]
+  > track = command, obsrevscache
   > EOF
   $ mkcommit() {
   >    echo "$1" > "$1"
@@ -258,6 +262,135 @@  We need to create a clone of 5 and add a
 
   $ cd ..
 
+Cached obsolete revisions
+-------------------------
+
+  $ hg init cached
+  $ cd cached
+
+  $ mkcommit a
+  $ mkcommit b
+  $ hg debugobsolete `getid a`
+
+cache should be generated
+
+  $ hg log -G
+  @  1:7c3bad9141dc (draft) [tip ] add b
+  |
+  x  0:1f0dee641bb7 (draft) [ ] add a
+  
+  $ hg blackbox -l 3
+  *> log -G (glob)
+  *> failed to read obsrevs from cache (glob)
+  *> wrote 1 obsrevs to cache (glob)
+
+cache should be valid
+
+  $ hg log -G
+  @  1:7c3bad9141dc (draft) [tip ] add b
+  |
+  x  0:1f0dee641bb7 (draft) [ ] add a
+  
+  $ hg blackbox -l 2
+  *> log -G (glob)
+  *> read 1 obsrevs from cache (glob)
+
+phase shouldn't affect cache
+
+  $ hg phase --public 0
+  $ hg log -G
+  @  1:7c3bad9141dc (draft) [tip ] add b
+  |
+  o  0:1f0dee641bb7 (public) [ ] add a
+  
+  $ hg blackbox -l 2
+  *> log -G (glob)
+  *> read 1 obsrevs from cache (glob)
+
+cache should be invalidated if obsolete marker is added
+
+  $ hg update 0
+  0 files updated, 0 files merged, 1 files removed, 0 files unresolved
+  $ hg debugobsolete `getid b`
+  $ hg log -G
+  @  0:1f0dee641bb7 (public) [tip ] add a
+  
+  $ hg blackbox -l 3
+  *> log -G (glob)
+  *> ignore obsrevs cache: hash differs (glob)
+  *> wrote 2 obsrevs to cache (glob)
+
+cache should be invalidated if obsolete marker is rolled back
+
+  $ hg rollback
+  repository tip rolled back to revision 1 (undo debugobsolete)
+  $ hg log -G
+  o  1:7c3bad9141dc (draft) [tip ] add b
+  |
+  @  0:1f0dee641bb7 (public) [ ] add a
+  
+  $ hg blackbox -l 5
+  *> rollback (glob)
+  *> ignore obsrevs cache: hash differs (glob)
+  *> wrote 1 obsrevs to cache (glob)
+  *> log -G (glob)
+  *> read 1 obsrevs from cache (glob)
+
+cache should be invalidated if heads differ because revision may change
+after strip
+
+  $ mkcommit c
+  created new head
+  $ hg debugobsolete `getid b`
+  $ hg update 0
+  0 files updated, 0 files merged, 1 files removed, 0 files unresolved
+  $ hg log -G --hidden
+  o  2:d67cd0334eee (draft) [tip ] add c
+  |
+  | x  1:7c3bad9141dc (draft) [ ] add b
+  |/
+  @  0:1f0dee641bb7 (public) [ ] add a
+  
+  $ hg blackbox -l 2
+  *> log -G --hidden (glob)
+  *> read 2 obsrevs from cache (glob)
+
+  $ hg --hidden --config extensions.strip= strip --no-backup 1
+  $ hg log -G
+  o  1:d67cd0334eee (draft) [tip ] add c
+  |
+  @  0:1f0dee641bb7 (public) [ ] add a
+  
+  $ hg blackbox -l 5
+  *> --hidden strip --no-backup 1 (glob)
+  *> ignore obsrevs cache: hash differs (glob)
+  *> wrote 1 obsrevs to cache (glob)
+  *> log -G (glob)
+  *> read 1 obsrevs from cache (glob)
+
+no error should occur if cache directory is read-only
+
+#if unix-permissions
+
+  $ rm .hg/cache/obsrevs-*
+  $ chmod 555 .hg/cache
+
+  $ hg log -G
+  o  1:d67cd0334eee (draft) [tip ] add c
+  |
+  @  0:1f0dee641bb7 (public) [ ] add a
+  
+  $ hg blackbox -l 3
+  *> log -G (glob)
+  *> failed to read obsrevs from cache (glob)
+  *> failed to write obsrevs to cache (glob)
+
+  $ chmod 755 .hg/cache
+
+#endif
+
+  $ cd ..
+
 Revision 0 is hidden
 --------------------