Patchwork [7,of,8,zstd-revlogs] revlog: use compression engine APIs for decompression

login
register
mail settings
Submitter Gregory Szorc
Date Jan. 2, 2017, 11:57 p.m.
Message ID <96837453d041fe7289dd.1483401478@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/18081/
State Changes Requested
Headers show

Comments

Gregory Szorc - Jan. 2, 2017, 11:57 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1483395249 28800
#      Mon Jan 02 14:14:09 2017 -0800
# Node ID 96837453d041fe7289dd3e6450aee4961517c5b5
# Parent  da38ab97f1f4d312966431a16d1f2acb90b75242
revlog: use compression engine APIs for decompression

Now that compression engines declare their header in revlog chunks
and can decompress revlog chunks, we refactor revlog.decompress()
to use them.

Making full use of the property that revlog compressor objects are
reusable, revlog instances now maintain a dict mapping an engine's
revlog header to a compressor object. This is not only a performance
optimization for engines where compressor object reuse can result in
better performance, but it also serves as a cache of header values
so we don't need to perform redundant lookups against the compression
engine manager. (Yes, I measured and the overhead of a function call
versus a dict lookup was observed.)

This change effectively replaces manual, inline lookup code with a
dict lookup plus extra function call where compression is actually
used. On the mozilla-unified repository:

$ hg perfrevlogchunks -c
! chunk
! wall 1.955183 comb 1.960000 user 1.930000 sys 0.030000 (best of 6)
! wall 2.003241 comb 2.000000 user 1.960000 sys 0.040000 (best of 5)
! chunk batch
! wall 1.774444 comb 1.770000 user 1.750000 sys 0.020000 (best of 6)
! wall 1.866939 comb 1.850000 user 1.840000 sys 0.010000 (best of 6)

$ hg perfrevlogchunks -m
! chunk
! wall 2.592354 comb 2.600000 user 2.560000 sys 0.040000 (best of 4)
! wall 2.733626 comb 2.730000 user 2.670000 sys 0.060000 (best of 4)
! chunk batch
! wall 2.469384 comb 2.480000 user 2.420000 sys 0.060000 (best of 4)
! wall 2.580613 comb 2.590000 user 2.550000 sys 0.040000 (best of 4)

We can see this change has a negative impact on performance.

There are 360,210 changelog entries (359,658 or 99.85% zlib compressed
and 552 with the 'u' header). The mean time per decompress() in
non-batch mode increased from 5.428us to 5.561us, a difference of
0.13us or 2.5%.

There are 359,342 manifest entries (242,566 or 67.50% zlib compressed,
116,602 or 32.45% with '\0' header, and 174 or 0.05% that are empty or
have the 'u' header). The mean time per decompress() in non-batch mode
increased from 7.214us to 7.607us, a difference of 0.393us or 5.4%.

If we reinsert inline code for "if t == 'x': return decompress(data)",
performance remains unchanged or even becomes faster than before
(if we avoid the checks for '\0' and 'u' before 'x'). My conclusion is
that attribute lookups and function calls in Python can matter for
performance. We've known this to be true, which is why tight loops
leverage local variables to avoid attribute lookups. But I'll be honest:
I was surprised to see this come into play for decompression: I would
have expected time spent in the decompressor to dwarf Python time.

Anyway, `hg perfrevlogchunks` is a microbenchmark. It's a good tool
to measure raw performance, but it may not be indicative of real world
behavior. For example, the performance loss on manifests as a
percentage is troubling. But very few operations need to decompress
every manifest chunk. And when they do, they typically have to perform
other CPU intensive behavior like applying the diff. So the performance
loss on manifests doesn't concern me as much.

What about changelogs? Again, there are only so many operations that
need to read every changelog revision. Fortunately, we have other
perf commands to measure the impact on these operations:

$ hg perfrevlogrevision -c -d 1
! wall 4.624776 comb 4.620000 user 4.600000 sys 0.020000 (best of 3)
! wall 4.711967 comb 4.710000 user 4.700000 sys 0.010000 (best of 3)

$ hg perfrevset 'author(mpm)'
! wall 9.611942 comb 9.600000 user 9.570000 sys 0.030000 (best of 3)
! wall 9.915843 comb 9.900000 user 9.860000 sys 0.040000 (best of 3)

$ hg perfrevset 'author(mpm)' --contexts
! wall 9.930018 comb 9.930000 user 9.900000 sys 0.030000 (best of 3)
! wall 10.007671 comb 10.000000 user 9.970000 sys 0.030000 (best of 3)

Again, we see performance regressions.

I feel bad about introducing a performance regression. But, I can
argue that's the price to pay for an abstraction. I suppose we could
restore the inline code for zlib (and presumably zstd someday) so we
don't go through the compression engine API. We could have "tier 1"
compression formats where we support fast/inline operations while still
allowing extensions to implement their own, fully capable compression
engines.
Pierre-Yves David - Jan. 13, 2017, 2:38 p.m.
On 01/03/2017 12:57 AM, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1483395249 28800
> #      Mon Jan 02 14:14:09 2017 -0800
> # Node ID 96837453d041fe7289dd3e6450aee4961517c5b5
> # Parent  da38ab97f1f4d312966431a16d1f2acb90b75242
> revlog: use compression engine APIs for decompression
>
> Now that compression engines declare their header in revlog chunks
> and can decompress revlog chunks, we refactor revlog.decompress()
> to use them.
>
> Making full use of the property that revlog compressor objects are
> reusable, revlog instances now maintain a dict mapping an engine's
> revlog header to a compressor object. This is not only a performance
> optimization for engines where compressor object reuse can result in
> better performance, but it also serves as a cache of header values
> so we don't need to perform redundant lookups against the compression
> engine manager. (Yes, I measured and the overhead of a function call
> versus a dict lookup was observed.)
>
> This change effectively replaces manual, inline lookup code with a
> dict lookup plus extra function call where compression is actually
> used. On the mozilla-unified repository:
>
> $ hg perfrevlogchunks -c
> ! chunk
> ! wall 1.955183 comb 1.960000 user 1.930000 sys 0.030000 (best of 6)
> ! wall 2.003241 comb 2.000000 user 1.960000 sys 0.040000 (best of 5)
> ! chunk batch
> ! wall 1.774444 comb 1.770000 user 1.750000 sys 0.020000 (best of 6)
> ! wall 1.866939 comb 1.850000 user 1.840000 sys 0.010000 (best of 6)
>
> $ hg perfrevlogchunks -m
> ! chunk
> ! wall 2.592354 comb 2.600000 user 2.560000 sys 0.040000 (best of 4)
> ! wall 2.733626 comb 2.730000 user 2.670000 sys 0.060000 (best of 4)
> ! chunk batch
> ! wall 2.469384 comb 2.480000 user 2.420000 sys 0.060000 (best of 4)
> ! wall 2.580613 comb 2.590000 user 2.550000 sys 0.040000 (best of 4)
>
> We can see this change has a negative impact on performance.
>
> There are 360,210 changelog entries (359,658 or 99.85% zlib compressed
> and 552 with the 'u' header). The mean time per decompress() in
> non-batch mode increased from 5.428us to 5.561us, a difference of
> 0.13us or 2.5%.
>
> There are 359,342 manifest entries (242,566 or 67.50% zlib compressed,
> 116,602 or 32.45% with '\0' header, and 174 or 0.05% that are empty or
> have the 'u' header). The mean time per decompress() in non-batch mode
> increased from 7.214us to 7.607us, a difference of 0.393us or 5.4%.
>
> If we reinsert inline code for "if t == 'x': return decompress(data)",
> performance remains unchanged or even becomes faster than before
> (if we avoid the checks for '\0' and 'u' before 'x'). My conclusion is
> that attribute lookups and function calls in Python can matter for
> performance. We've known this to be true, which is why tight loops
> leverage local variables to avoid attribute lookups. But I'll be honest:
> I was surprised to see this come into play for decompression: I would
> have expected time spent in the decompressor to dwarf Python time.
>
> Anyway, `hg perfrevlogchunks` is a microbenchmark. It's a good tool
> to measure raw performance, but it may not be indicative of real world
> behavior. For example, the performance loss on manifests as a
> percentage is troubling. But very few operations need to decompress
> every manifest chunk. And when they do, they typically have to perform
> other CPU intensive behavior like applying the diff. So the performance
> loss on manifests doesn't concern me as much.
>
> What about changelogs? Again, there are only so many operations that
> need to read every changelog revision. Fortunately, we have other
> perf commands to measure the impact on these operations:
>
> $ hg perfrevlogrevision -c -d 1
> ! wall 4.624776 comb 4.620000 user 4.600000 sys 0.020000 (best of 3)
> ! wall 4.711967 comb 4.710000 user 4.700000 sys 0.010000 (best of 3)
>
> $ hg perfrevset 'author(mpm)'
> ! wall 9.611942 comb 9.600000 user 9.570000 sys 0.030000 (best of 3)
> ! wall 9.915843 comb 9.900000 user 9.860000 sys 0.040000 (best of 3)
>
> $ hg perfrevset 'author(mpm)' --contexts
> ! wall 9.930018 comb 9.930000 user 9.900000 sys 0.030000 (best of 3)
> ! wall 10.007671 comb 10.000000 user 9.970000 sys 0.030000 (best of 3)
>
> Again, we see performance regressions.
>
> I feel bad about introducing a performance regression. But, I can
> argue that's the price to pay for an abstraction. I suppose we could
> restore the inline code for zlib (and presumably zstd someday) so we
> don't go through the compression engine API. We could have "tier 1"
> compression formats where we support fast/inline operations while still
> allowing extensions to implement their own, fully capable compression
> engines.

The regression percentage are noticeable enough (I've seen patch for 
less) that I think hardcoding the known common case make sense to avoid 
the regression.
That is not super nice but that provide a good way to move forward (in 
that great direction) without


>
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -18,7 +18,6 @@ import errno
>  import hashlib
>  import os
>  import struct
> -import zlib
>
>  # import stuff from node for others to import from revlog
>  from .node import (
> @@ -39,7 +38,6 @@ from . import (
>
>  _pack = struct.pack
>  _unpack = struct.unpack
> -_decompress = zlib.decompress
>
>  # revlog header flags
>  REVLOGV0 = 0
> @@ -298,6 +296,8 @@ class revlog(object):
>              self._chunkclear()
>          # revnum -> (chain-length, sum-delta-length)
>          self._chaininfocache = {}
> +        # revlog header -> revlog compressor
> +        self._decompressors = {}
>
>      @util.propertycache
>      def _compressor(self):
> @@ -1374,17 +1374,43 @@ class revlog(object):
>          """
>          if not data:
>              return data
> +
> +        # Revlogs are read much more frequently than they are written, so
> +        # performance of this function is important.
> +        #
> +        # We can make a few assumptions about revlogs:
> +        #
> +        # 1) the majority of chunks will be compressed (as opposed to inline
> +        #    raw data).
> +        # 2) decompressing *any* data will likely by at least 10x slower than
> +        #    returning raw inline data.
> +        #
> +        # It follows that we want to optimize the "decompress compressed data"
> +        # case over "return inline data." The code below does this by putting
> +        # the self._decompressors lookup first and *then* checks for our
> +        # special cases of the ``\0`` and ``u`` headers.
> +        #
> +        # For cases where data is actually compressed, this avoids comparison
> +        # against the "null" decompressors and allows us to dispatch to the
> +        # decompressor quicker. According to `hg perfrevlogchunks`, this is
> +        # ~0.5% faster.
>          t = data[0]
> -        if t == '\0':
> -            return data
> -        if t == 'x':
> +        try:
> +            compressor = self._decompressors[t]
> +        except KeyError:
> +            if t == '\0':
> +                return data
> +            elif t == 'u':
> +                return util.buffer(data, 1)
> +
>              try:
> -                return _decompress(data)
> -            except zlib.error as e:
> -                raise RevlogError(_('revlog decompress error: %s') % str(e))
> -        if t == 'u':
> -            return util.buffer(data, 1)
> -        raise RevlogError(_('unknown compression type %r') % t)
> +                engine = util.compengines.forrevlogheader(t)
> +                compressor = engine.revlogcompressor()
> +                self._decompressors[t] = compressor
> +            except KeyError:
> +                raise RevlogError(_('unknown compression type %r') % t)
> +
> +        return compressor.decompress(data)
>
>      def _isgooddelta(self, d, textlen):
>          """Returns True if the given delta is good. Good means that it is within
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>

Patch

diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -18,7 +18,6 @@  import errno
 import hashlib
 import os
 import struct
-import zlib
 
 # import stuff from node for others to import from revlog
 from .node import (
@@ -39,7 +38,6 @@  from . import (
 
 _pack = struct.pack
 _unpack = struct.unpack
-_decompress = zlib.decompress
 
 # revlog header flags
 REVLOGV0 = 0
@@ -298,6 +296,8 @@  class revlog(object):
             self._chunkclear()
         # revnum -> (chain-length, sum-delta-length)
         self._chaininfocache = {}
+        # revlog header -> revlog compressor
+        self._decompressors = {}
 
     @util.propertycache
     def _compressor(self):
@@ -1374,17 +1374,43 @@  class revlog(object):
         """
         if not data:
             return data
+
+        # Revlogs are read much more frequently than they are written, so
+        # performance of this function is important.
+        #
+        # We can make a few assumptions about revlogs:
+        #
+        # 1) the majority of chunks will be compressed (as opposed to inline
+        #    raw data).
+        # 2) decompressing *any* data will likely by at least 10x slower than
+        #    returning raw inline data.
+        #
+        # It follows that we want to optimize the "decompress compressed data"
+        # case over "return inline data." The code below does this by putting
+        # the self._decompressors lookup first and *then* checks for our
+        # special cases of the ``\0`` and ``u`` headers.
+        #
+        # For cases where data is actually compressed, this avoids comparison
+        # against the "null" decompressors and allows us to dispatch to the
+        # decompressor quicker. According to `hg perfrevlogchunks`, this is
+        # ~0.5% faster.
         t = data[0]
-        if t == '\0':
-            return data
-        if t == 'x':
+        try:
+            compressor = self._decompressors[t]
+        except KeyError:
+            if t == '\0':
+                return data
+            elif t == 'u':
+                return util.buffer(data, 1)
+
             try:
-                return _decompress(data)
-            except zlib.error as e:
-                raise RevlogError(_('revlog decompress error: %s') % str(e))
-        if t == 'u':
-            return util.buffer(data, 1)
-        raise RevlogError(_('unknown compression type %r') % t)
+                engine = util.compengines.forrevlogheader(t)
+                compressor = engine.revlogcompressor()
+                self._decompressors[t] = compressor
+            except KeyError:
+                raise RevlogError(_('unknown compression type %r') % t)
+
+        return compressor.decompress(data)
 
     def _isgooddelta(self, d, textlen):
         """Returns True if the given delta is good. Good means that it is within