Patchwork [2,of,2,RFC] revlog: establish a cache of decompressed revision chunks

login
register
mail settings
Submitter Gregory Szorc
Date Nov. 23, 2015, 6:28 a.m.
Message ID <5ed02a9105be48e7a595.1448260131@ubuntu-main>
Download mbox | patch
Permalink /patch/11599/
State Deferred
Headers show

Comments

Gregory Szorc - Nov. 23, 2015, 6:28 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1448257271 28800
#      Sun Nov 22 21:41:11 2015 -0800
# Node ID 5ed02a9105be48e7a595bb8e6168dceccdad7a50
# Parent  5d0aff0739984ebdcc5c99a84d338749cd702119
revlog: establish a cache of decompressed revision chunks

(THIS PATCH IS NOT APPROPRIATE FOR LANDING.)

Profiling reveals that we spend a lot of time doing zlib decompression
on revlogs. Furthermore, while we cache the raw, compressed revlog
entries, we don't cache the decompressed output.

This patch introduces a LRU cache for the raw content of decompressed
revisions. It isn't suitable for checkin because there are no settings
to tune it. The default settings for manifest caching will likely
consume way too much memory on insanely large repositories. Also, we
should investigate caching multiple fulltext entries instead of the
bdiffs, as bdiff processing is not cheap.

This patch has a profound impact on performance.

On mozilla-central (non-gd):

`hg log -r 'first(reverse(::tip), 100)' -p`
before: 30.93s
after:   7.57s
delta: -23.36s (24.5% of original)

`hg perfmanifest 51fa3e0d4f7b`
before: wall 0.287174 comb 0.290000 user 0.290000 sys 0.000000 (best of 35)
after:  wall 0.046735 comb 0.050000 user 0.050000 sys 0.000000 (best of 100)
delta:      -0.240439 (16.3% of original)
(this test is cheating because it just accesses the manifest revision over and
over)

`hg log -r 0:tip`
before: 22.48s
after:  21.00s
delta:  -1.48s (93.4% of original)

On a generaldelta conversion of mozilla-central:

`hg log -r 'first(reverse(::tip), 100)' -p`
before: 42.13s
after:  14.02s
delta: -28.11s (33.3% or original)

`hg perfmanifest 51fa3e0d4f7b`
before: wall 0.400207 comb 0.400000 user 0.390000 sys 0.010000 (best of 25)
after:  wall 0.107959 comb 0.110000 user 0.110000 sys 0.000000 (best of 87)

The reason generaldelta is slower is because the delta chains for this
repo are very long. Max chain is in the 50,000 range! This is because I
used format.aggressivemergedeltas and that mode is insane at keeping
deltas small. These long delta chains also expose another weakness in this
patch: tail chasing cache eviction. If our delta chains are longer than
the cache size, we won't get cache hits, even if the cache originally
contained items relevant to us. We may want to perform an additional
walk to find all cache entries before potentially evicting them.

Other things that need addressed before this lands:

* Convert revchunk cache to lrucachedict? We currently storing buffers
  and returning views into them then slicing that further. Not sure why
  this is done.
* Kill the revchunk cache? Seems pretty worthless if we're caching its
  decompressed results as well.
* Cache resolved fulltexts instead?
* Make cache settings configurable.
* Experiment with different default settings.
* Make cache size limited by size of content not number of elements.
* Investigate actual memory consumption.
* Lazy initialize nodes in lrucachedict (don't preallocate because this
  is overhead).
* Investigate potential over chunk fetching in revlog.revision()
  (looking at raw requests it appears we're dropping base fulltexts
  more than we should)
Gregory Szorc - Nov. 23, 2015, 6:59 a.m.
On Sun, Nov 22, 2015 at 10:28 PM, Gregory Szorc <gregory.szorc@gmail.com>
wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1448257271 28800
> #      Sun Nov 22 21:41:11 2015 -0800
> # Node ID 5ed02a9105be48e7a595bb8e6168dceccdad7a50
> # Parent  5d0aff0739984ebdcc5c99a84d338749cd702119
> revlog: establish a cache of decompressed revision chunks
>
> (THIS PATCH IS NOT APPROPRIATE FOR LANDING.)
>
> Profiling reveals that we spend a lot of time doing zlib decompression
> on revlogs. Furthermore, while we cache the raw, compressed revlog
> entries, we don't cache the decompressed output.
>
> This patch introduces a LRU cache for the raw content of decompressed
> revisions. It isn't suitable for checkin because there are no settings
> to tune it. The default settings for manifest caching will likely
> consume way too much memory on insanely large repositories. Also, we
> should investigate caching multiple fulltext entries instead of the
> bdiffs, as bdiff processing is not cheap.
>
> This patch has a profound impact on performance.
>
> On mozilla-central (non-gd):
>
> `hg log -r 'first(reverse(::tip), 100)' -p`
> before: 30.93s
> after:   7.57s
> delta: -23.36s (24.5% of original)
>

For the curious, this command executes in 4.90s when the repository is
converted to lz4. At that point statprof says SHA-1 hashing, manifest
diffing, diff generation, and revlog reading are where the remaining time
is spent. Color me surprised that SHA-1 calculation is taking that long.
But if you do the math, we're passing 1,367,257,372 bytes into this
function. So, uh, 1+ GB/s sounds reasonable, especially for Python. Perhaps
I'll need to put an LRU cache on SHA-1 hashing :)

  %   cumulative      self
 time    seconds   seconds  name
 20.07      1.01      1.01  revlog.py:79:hash
 17.68      0.89      0.89  revlog.py:1213:revision
 13.08      0.66      0.66  manifest.py:293:diff
  5.89      0.30      0.30  mdiff.py:121:allblocks
  4.42      0.22      0.22  mdiff.py:14:splitnewlines
  3.68      0.18      0.18  manifest.py:312:copy
  2.76      0.70      0.14  revlog.py:1084:grabchunks
  2.39      0.12      0.12  lz4revlog.py:82:decompress
  1.84      0.09      0.09  util.py:500:__getitem__
  1.29      0.06      0.06  util.py:553:_recordaccess
  1.29      0.06      0.06  revlog.py:1196:revision
  1.29      0.06      0.06  util.py:552:_recordaccess
  1.29      0.06      0.06  util.py:555:_recordaccess
  1.29      0.06      0.06  manifest.py:968:read
  1.10      0.06      0.06  util.py:554:_recordaccess



>
> `hg perfmanifest 51fa3e0d4f7b`
> before: wall 0.287174 comb 0.290000 user 0.290000 sys 0.000000 (best of 35)
> after:  wall 0.046735 comb 0.050000 user 0.050000 sys 0.000000 (best of
> 100)
> delta:      -0.240439 (16.3% of original)
> (this test is cheating because it just accesses the manifest revision over
> and
> over)
>
> `hg log -r 0:tip`
> before: 22.48s
> after:  21.00s
> delta:  -1.48s (93.4% of original)
>
> On a generaldelta conversion of mozilla-central:
>
> `hg log -r 'first(reverse(::tip), 100)' -p`
> before: 42.13s
> after:  14.02s
> delta: -28.11s (33.3% or original)
>
> `hg perfmanifest 51fa3e0d4f7b`
> before: wall 0.400207 comb 0.400000 user 0.390000 sys 0.010000 (best of 25)
> after:  wall 0.107959 comb 0.110000 user 0.110000 sys 0.000000 (best of 87)
>
> The reason generaldelta is slower is because the delta chains for this
> repo are very long. Max chain is in the 50,000 range! This is because I
> used format.aggressivemergedeltas and that mode is insane at keeping
> deltas small. These long delta chains also expose another weakness in this
> patch: tail chasing cache eviction. If our delta chains are longer than
> the cache size, we won't get cache hits, even if the cache originally
> contained items relevant to us. We may want to perform an additional
> walk to find all cache entries before potentially evicting them.
>
> Other things that need addressed before this lands:
>
> * Convert revchunk cache to lrucachedict? We currently storing buffers
>   and returning views into them then slicing that further. Not sure why
>   this is done.
> * Kill the revchunk cache? Seems pretty worthless if we're caching its
>   decompressed results as well.
> * Cache resolved fulltexts instead?
> * Make cache settings configurable.
> * Experiment with different default settings.
> * Make cache size limited by size of content not number of elements.
> * Investigate actual memory consumption.
> * Lazy initialize nodes in lrucachedict (don't preallocate because this
>   is overhead).
> * Investigate potential over chunk fetching in revlog.revision()
>   (looking at raw requests it appears we're dropping base fulltexts
>   more than we should)
>
> diff --git a/mercurial/manifest.py b/mercurial/manifest.py
> --- a/mercurial/manifest.py
> +++ b/mercurial/manifest.py
> @@ -862,16 +862,18 @@ class treemanifest(object):
>          for d, subm in self._dirs.iteritems():
>              subp1 = m1._dirs.get(d, emptytree)._node
>              subp2 = m2._dirs.get(d, emptytree)._node
>              if subp1 == revlog.nullid:
>                  subp1, subp2 = subp2, subp1
>              writesubtree(subm, subp1, subp2)
>
>  class manifest(revlog.revlog):
> +    DECOMPRESSEDCACHESIZE = 20000
> +
>      def __init__(self, opener, dir='', dirlogcache=None):
>          '''The 'dir' and 'dirlogcache' arguments are for internal use by
>          manifest.manifest only. External users should create a root
> manifest
>          log with manifest.manifest(opener) and call dirlog() on it.
>          '''
>          # During normal operations, we expect to deal with not more than
> four
>          # revs at a time (such as during commit --amend). When rebasing
> large
>          # stacks of commits, the number can go up, hence the config knob
> below.
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -190,16 +190,18 @@ class revlog(object):
>      version data. This makes retrieval of a version proportional to
>      its size, or O(1) relative to the number of revisions.
>
>      Both pieces of the revlog are written to in an append-only
>      fashion, which means we never need to rewrite a file to insert or
>      remove data, and can use some simple techniques to avoid the need
>      for locking while reading.
>      """
> +    DECOMPRESSEDCACHESIZE = 10
> +
>      def __init__(self, opener, indexfile):
>          """
>          create a revlog object
>
>          opener is a function that abstracts the file opening operation
>          and can be used to implement COW semantics or the like.
>          """
>          self.indexfile = indexfile
> @@ -209,16 +211,17 @@ class revlog(object):
>          self._cache = None
>          # 2-tuple of (rev, baserev) defining the base revision the delta
> chain
>          # begins at for a revision.
>          self._basecache = None
>          # 2-tuple of (offset, data) of raw data from the revlog at an
> offset.
>          self._chunkcache = (0, '')
>          # How much data to read and cache into the raw revlog data cache.
>          self._chunkcachesize = 65536
> +        self._decompressedcache =
> util.lrucachedict(self.DECOMPRESSEDCACHESIZE)
>          self._maxchainlen = None
>          self._aggressivemergedeltas = False
>          self.index = []
>          # Mapping of partial identifiers to full nodes.
>          self._pcache = {}
>          # Mapping of revision integer to full node.
>          self._nodecache = {nullid: nullrev}
>          self._nodepos = None
> @@ -1055,38 +1058,47 @@ class revlog(object):
>          """
>          if not revs:
>              return []
>          start = self.start
>          length = self.length
>          inline = self._inline
>          iosize = self._io.size
>          buffer = util.buffer
> +        decompressedcache = self._decompressedcache
>
>          l = []
>          ladd = l.append
>
>          # Historically, the list of revs for a delta chain has been
> contiguous,
>          # equivalent to ``range(revs[0], revs[-1])``. With generaldelta,
> the
>          # chain can have holes. If we performed a single request for all
> revs
>          # between base and tip, we could pull in data for several
> unrelated
>          # revisions.
>          #
>          # Our strategy is to walk the revisions and obtain only the data
> for
>          # revisions we care about.
>          def grabchunks(startrev, endrev):
>              try:
>                  offset, data = self._chunkraw(startrev, endrev, df=df)
>                  for rev in range(startrev, endrev + 1):
> +                    try:
> +                        ladd(decompressedcache[rev])
> +                        continue
> +                    except KeyError:
> +                        pass
> +
>                      chunkstart = start(rev)
>                      if inline:
>                          chunkstart += (rev + 1) * iosize
>                      chunklength = length(rev)
>                      revchunk = buffer(data, chunkstart - offset,
> chunklength)
> -                    ladd(decompress(revchunk))
> +                    decompressed = decompress(revchunk)
> +                    decompressedcache[rev] = decompressed
> +                    ladd(decompressed)
>              except OverflowError:
>                  # issue4215 - we can't cache a run of chunks greater than
>                  # 2G on Windows.
>                  for rev in range(startrev, endrev + 1):
>                      ladd(self._chunk(rev, df=df))
>
>          maxindex = len(revs) - 1
>          lastrev = revs[0] - 1
>

Patch

diff --git a/mercurial/manifest.py b/mercurial/manifest.py
--- a/mercurial/manifest.py
+++ b/mercurial/manifest.py
@@ -862,16 +862,18 @@  class treemanifest(object):
         for d, subm in self._dirs.iteritems():
             subp1 = m1._dirs.get(d, emptytree)._node
             subp2 = m2._dirs.get(d, emptytree)._node
             if subp1 == revlog.nullid:
                 subp1, subp2 = subp2, subp1
             writesubtree(subm, subp1, subp2)
 
 class manifest(revlog.revlog):
+    DECOMPRESSEDCACHESIZE = 20000
+
     def __init__(self, opener, dir='', dirlogcache=None):
         '''The 'dir' and 'dirlogcache' arguments are for internal use by
         manifest.manifest only. External users should create a root manifest
         log with manifest.manifest(opener) and call dirlog() on it.
         '''
         # During normal operations, we expect to deal with not more than four
         # revs at a time (such as during commit --amend). When rebasing large
         # stacks of commits, the number can go up, hence the config knob below.
diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -190,16 +190,18 @@  class revlog(object):
     version data. This makes retrieval of a version proportional to
     its size, or O(1) relative to the number of revisions.
 
     Both pieces of the revlog are written to in an append-only
     fashion, which means we never need to rewrite a file to insert or
     remove data, and can use some simple techniques to avoid the need
     for locking while reading.
     """
+    DECOMPRESSEDCACHESIZE = 10
+
     def __init__(self, opener, indexfile):
         """
         create a revlog object
 
         opener is a function that abstracts the file opening operation
         and can be used to implement COW semantics or the like.
         """
         self.indexfile = indexfile
@@ -209,16 +211,17 @@  class revlog(object):
         self._cache = None
         # 2-tuple of (rev, baserev) defining the base revision the delta chain
         # begins at for a revision.
         self._basecache = None
         # 2-tuple of (offset, data) of raw data from the revlog at an offset.
         self._chunkcache = (0, '')
         # How much data to read and cache into the raw revlog data cache.
         self._chunkcachesize = 65536
+        self._decompressedcache = util.lrucachedict(self.DECOMPRESSEDCACHESIZE)
         self._maxchainlen = None
         self._aggressivemergedeltas = False
         self.index = []
         # Mapping of partial identifiers to full nodes.
         self._pcache = {}
         # Mapping of revision integer to full node.
         self._nodecache = {nullid: nullrev}
         self._nodepos = None
@@ -1055,38 +1058,47 @@  class revlog(object):
         """
         if not revs:
             return []
         start = self.start
         length = self.length
         inline = self._inline
         iosize = self._io.size
         buffer = util.buffer
+        decompressedcache = self._decompressedcache
 
         l = []
         ladd = l.append
 
         # Historically, the list of revs for a delta chain has been contiguous,
         # equivalent to ``range(revs[0], revs[-1])``. With generaldelta, the
         # chain can have holes. If we performed a single request for all revs
         # between base and tip, we could pull in data for several unrelated
         # revisions.
         #
         # Our strategy is to walk the revisions and obtain only the data for
         # revisions we care about.
         def grabchunks(startrev, endrev):
             try:
                 offset, data = self._chunkraw(startrev, endrev, df=df)
                 for rev in range(startrev, endrev + 1):
+                    try:
+                        ladd(decompressedcache[rev])
+                        continue
+                    except KeyError:
+                        pass
+
                     chunkstart = start(rev)
                     if inline:
                         chunkstart += (rev + 1) * iosize
                     chunklength = length(rev)
                     revchunk = buffer(data, chunkstart - offset, chunklength)
-                    ladd(decompress(revchunk))
+                    decompressed = decompress(revchunk)
+                    decompressedcache[rev] = decompressed
+                    ladd(decompressed)
             except OverflowError:
                 # issue4215 - we can't cache a run of chunks greater than
                 # 2G on Windows.
                 for rev in range(startrev, endrev + 1):
                     ladd(self._chunk(rev, df=df))
 
         maxindex = len(revs) - 1
         lastrev = revs[0] - 1