Patchwork [4,of,4] revlog: separate methods for deltas and key frames compression/decompression

login
register
mail settings
Submitter Wojciech Lopata
Date Sept. 26, 2013, 6:14 p.m.
Message ID <b7528dd2041110229810.1380219285@dev1179.prn1.facebook.com>
Download mbox | patch
Permalink /patch/2646/
State Deferred
Headers show

Comments

Wojciech Lopata - Sept. 26, 2013, 6:14 p.m.
# HG changeset patch
# User Wojciech Lopata <lopek@fb.com>
# Date 1380212200 25200
#      Thu Sep 26 09:16:40 2013 -0700
# Node ID b7528dd2041110229810e71245eefa4d70aca648
# Parent  16ba1a7b494caeb45d24ca81d49bbc866af1e7c5
revlog: separate methods for deltas and key frames compression/decompression

This change makes possible:
  1) to use different compression methods in various revlog subclasses
  2) to use different compression methods for key frames and deltas
Durham Goode - Sept. 26, 2013, 8:15 p.m.
On 9/26/13 11:14 AM, "Wojciech Lopata" <lopek@fb.com> wrote:

># HG changeset patch
># User Wojciech Lopata <lopek@fb.com>
># Date 1380212200 25200
>#      Thu Sep 26 09:16:40 2013 -0700
># Node ID b7528dd2041110229810e71245eefa4d70aca648
># Parent  16ba1a7b494caeb45d24ca81d49bbc866af1e7c5
>revlog: separate methods for deltas and key frames
>compression/decompression
>
>This change makes possible:
>  1) to use different compression methods in various revlog subclasses
>  2) to use different compression methods for key frames and deltas

Do you have numbers showing what benefits an extension could get by doing
key frame and delta compression differently?
Wojciech Lopata - Sept. 26, 2013, 8:17 p.m.
Siddharth noticed this series is not only going to break the lz4revlog extension, but will also make it pretty difficult to fix.

Let me modify it and resend soon. 

> -----Original Message-----
> From: mercurial-devel-bounces@selenic.com [mailto:mercurial-devel-
> bounces@selenic.com] On Behalf Of Wojciech Lopata
> Sent: Thursday, September 26, 2013 11:15 AM
> To: mercurial-devel@selenic.com
> Subject: [PATCH 4 of 4] revlog: separate methods for deltas and key frames
> compression/decompression
> 
> # HG changeset patch
> # User Wojciech Lopata <lopek@fb.com>
> # Date 1380212200 25200
> #      Thu Sep 26 09:16:40 2013 -0700
> # Node ID b7528dd2041110229810e71245eefa4d70aca648
> # Parent  16ba1a7b494caeb45d24ca81d49bbc866af1e7c5
> revlog: separate methods for deltas and key frames
> compression/decompression
> 
> This change makes possible:
>   1) to use different compression methods in various revlog subclasses
>   2) to use different compression methods for key frames and deltas
> 
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -881,11 +881,23 @@
>          length = end - start
>          return self._getchunk(start, length)
> 
> +    def compressdelta(self, delta):
> +        return compress(delta)
> +
> +    def compresskeyframe(self, text):
> +        return compress(text)
> +
> +    def decompressdelta(self, bin):
> +        return decompress(bin)
> +
> +    def decompresskeyframe(self, bin):
> +        return decompress(bin)
> +
>      def _deltachunk(self, rev):
> -        return decompress(self._chunkraw(rev, rev))
> +        return self.decompressdelta(self._chunkraw(rev, rev))
> 
>      def _keyframechunk(self, rev):
> -        return decompress(self._chunkraw(rev, rev))
> +        return self.decompresskeyframe(self._chunkraw(rev, rev))
> 
>      def _deltachunks(self, revs):
>          '''faster version of [self._deltachunk(rev) for rev in revs] @@ -906,6
> +918,7 @@
>          self._chunkraw(revs[0], revs[-1])
>          offset, data = self._chunkcache
> 
> +        decompress = self.decompressdelta
>          for rev in revs:
>              chunkstart = start(rev)
>              if inline:
> @@ -1112,7 +1125,7 @@
>                  t = buildtext()
>                  ptext = self.revision(self.node(rev))
>                  delta = mdiff.textdiff(ptext, t)
> -            data = compress(delta)
> +            data = self.compressdelta(delta)
>              l = len(data[1]) + len(data[0])
>              if basecache[0] == rev:
>                  chainbase = basecache[1] @@ -1158,7 +1171,7 @@
>              textlen = len(text)
>          if d is None or dist > textlen * 2:
>              text = buildtext()
> -            data = compress(text)
> +            data = self.compresskeyframe(text)
>              l = len(data[1]) + len(data[0])
>              base = chainbase = curr
> 
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Wojciech Lopata - Sept. 26, 2013, 10:32 p.m.
> Do you have numbers showing what benefits an extension could get by doing key frame and delta compression differently?

This vary. The biggest gain of being able to apply different compressions for key frames and deltas I've seen for now was additional 5% manifest size reduction. I believe gain grows as repository grows, but I haven't checked what the gain is on big repositories, since with my draft, pure python implementation such conversion would take days.

> -----Original Message-----
> From: Durham Goode
> Sent: Thursday, September 26, 2013 1:15 PM
> To: Wojciech Lopata; mercurial-devel@selenic.com
> Subject: Re: [PATCH 4 of 4] revlog: separate methods for deltas and key
> frames compression/decompression
> 
> On 9/26/13 11:14 AM, "Wojciech Lopata" <lopek@fb.com> wrote:
> 
> ># HG changeset patch
> ># User Wojciech Lopata <lopek@fb.com>
> ># Date 1380212200 25200
> >#      Thu Sep 26 09:16:40 2013 -0700
> ># Node ID b7528dd2041110229810e71245eefa4d70aca648
> ># Parent  16ba1a7b494caeb45d24ca81d49bbc866af1e7c5
> >revlog: separate methods for deltas and key frames
> >compression/decompression
> >
> >This change makes possible:
> >  1) to use different compression methods in various revlog subclasses
> >  2) to use different compression methods for key frames and deltas
> 
> Do you have numbers showing what benefits an extension could get by
> doing key frame and delta compression differently?
Durham Goode - Sept. 26, 2013, 10:34 p.m.
On 9/26/13 3:32 PM, "Wojciech Lopata" <lopek@fb.com> wrote:

>> Do you have numbers showing what benefits an extension could get by
>>doing key frame and delta compression differently?
>
>This vary. The biggest gain of being able to apply different compressions
>for key frames and deltas I've seen for now was additional 5% manifest
>size reduction. I believe gain grows as repository grows, but I haven't
>checked what the gain is on big repositories, since with my draft, pure
>python implementation such conversion would take days.


But why can we not get the same gain from compressing both key frames and
deltas the same?  Why would we ever want to apply some compression to a
key frame, but not to a delta?
Wojciech Lopata - Sept. 26, 2013, 10:48 p.m.
> But why can we not get the same gain from compressing both key frames and deltas the same?  Why would we ever want to apply some compression to a key frame, but not to a delta?

I probably misused 'compression' term.

I need to introduce additional step that would occur after new revision is added to a revlog, but before gziping the revision text.

This will let me to apply 'stem compression' (described in http://mercurial.selenic.com/wiki/ImprovingManifestCompressionPlan) to key frames only.

Implementing stem compression on the manifest level (before adding a revision to a revlog) actually bums manifest size, since it increases size of deltas in some cases (when new file is added or renamed).

I will not call it compression any more.

> -----Original Message-----
> From: Durham Goode
> Sent: Thursday, September 26, 2013 3:35 PM
> To: Wojciech Lopata; mercurial-devel@selenic.com
> Subject: Re: [PATCH 4 of 4] revlog: separate methods for deltas and key
> frames compression/decompression
> 
> On 9/26/13 3:32 PM, "Wojciech Lopata" <lopek@fb.com> wrote:
> 
> >> Do you have numbers showing what benefits an extension could get by
> >>doing key frame and delta compression differently?
> >
> >This vary. The biggest gain of being able to apply different
> >compressions for key frames and deltas I've seen for now was additional
> >5% manifest size reduction. I believe gain grows as repository grows,
> >but I haven't checked what the gain is on big repositories, since with
> >my draft, pure python implementation such conversion would take days.
> 
> 
> But why can we not get the same gain from compressing both key frames
> and deltas the same?  Why would we ever want to apply some compression
> to a key frame, but not to a delta?
Durham Goode - Sept. 27, 2013, 1:48 a.m.
On 9/26/13 3:48 PM, "Wojciech Lopata" <lopek@fb.com> wrote:

>> But why can we not get the same gain from compressing both key frames
>>and deltas the same?  Why would we ever want to apply some compression
>>to a key frame, but not to a delta?
>
>I probably misused 'compression' term.
>
>I need to introduce additional step that would occur after new revision
>is added to a revlog, but before gziping the revision text.
>
>This will let me to apply 'stem compression' (described in
>http://mercurial.selenic.com/wiki/ImprovingManifestCompressionPlan) to
>key frames only.
>
>Implementing stem compression on the manifest level (before adding a
>revision to a revlog) actually bums manifest size, since it increases
>size of deltas in some cases (when new file is added or renamed).
>
>I will not call it compression any more.

From talking to Wojciech in person, there are certain compression
algorithms that can be applied to key frames but not to deltas, and if
they were applied to the actual manifest (prior to deltafication) can
result in larger deltas overall.  So splitting these functions up allows
us to compress key frames on their own.  Stem compressions (as described
in the Manifest compression wiki page) is an example of this.

I would still like to see what the actual benefits are (i.e. numbers) are
before adding the extra complexity to revlog.
Wojciech Lopata - Sept. 28, 2013, 7:14 p.m.
Numbers for Mozilla repository:

Regular manifest: 303752 (100%)
Delta friendly line breaks + binary node strings: 234016 (77.04%)
DFLB + BNS + stem compression: 230116 (75.75%)
	
So in this case we get additional 1.3% size reduction. Full revisions constitute 10% of mozilla's manifest, but in some cases it is about 20%, so the additional gain would be about 2.5%.

It's not a lot, but this is an easy gain on the other hand I believe.

> -----Original Message-----
> From: Durham Goode
> Sent: Thursday, September 26, 2013 6:48 PM
> To: Wojciech Lopata; mercurial-devel@selenic.com
> Subject: Re: [PATCH 4 of 4] revlog: separate methods for deltas and key
> frames compression/decompression
> 
> On 9/26/13 3:48 PM, "Wojciech Lopata" <lopek@fb.com> wrote:
> 
> >> But why can we not get the same gain from compressing both key frames
> >>and deltas the same?  Why would we ever want to apply some
> compression
> >>to a key frame, but not to a delta?
> >
> >I probably misused 'compression' term.
> >
> >I need to introduce additional step that would occur after new revision
> >is added to a revlog, but before gziping the revision text.
> >
> >This will let me to apply 'stem compression' (described in
> >http://mercurial.selenic.com/wiki/ImprovingManifestCompressionPlan) to
> >key frames only.
> >
> >Implementing stem compression on the manifest level (before adding a
> >revision to a revlog) actually bums manifest size, since it increases
> >size of deltas in some cases (when new file is added or renamed).
> >
> >I will not call it compression any more.
> 
> From talking to Wojciech in person, there are certain compression algorithms
> that can be applied to key frames but not to deltas, and if they were applied
> to the actual manifest (prior to deltafication) can result in larger deltas
> overall.  So splitting these functions up allows us to compress key frames on
> their own.  Stem compressions (as described in the Manifest compression
> wiki page) is an example of this.
> 
> I would still like to see what the actual benefits are (i.e. numbers) are before
> adding the extra complexity to revlog.
Durham Goode - Sept. 30, 2013, 6:49 p.m.
On 9/28/13 12:14 PM, "Wojciech Lopata" <lopek@fb.com> wrote:

>Numbers for Mozilla repository:
>
>Regular manifest: 303752 (100%)
>Delta friendly line breaks + binary node strings: 234016 (77.04%)
>DFLB + BNS + stem compression: 230116 (75.75%)
>	
>So in this case we get additional 1.3% size reduction. Full revisions
>constitute 10% of mozilla's manifest, but in some cases it is about 20%,
>so the additional gain would be about 2.5%.
>
>It's not a lot, but this is an easy gain on the other hand I believe.
>

I don't think a potential 2% is worth the extra complexity to the revlog
class. So I'd vote for leaving this out of core Mercurial.
Augie Fackler - Sept. 30, 2013, 7:26 p.m.
On Mon, Sep 30, 2013 at 06:49:09PM +0000, Durham Goode wrote:
> On 9/28/13 12:14 PM, "Wojciech Lopata" <lopek@fb.com> wrote:
>
> >Numbers for Mozilla repository:
> >
> >Regular manifest: 303752 (100%)
> >Delta friendly line breaks + binary node strings: 234016 (77.04%)
> >DFLB + BNS + stem compression: 230116 (75.75%)
> >
> >So in this case we get additional 1.3% size reduction. Full revisions
> >constitute 10% of mozilla's manifest, but in some cases it is about 20%,
> >so the additional gain would be about 2.5%.
> >
> >It's not a lot, but this is an easy gain on the other hand I believe.
> >
>
> I don't think a potential 2% is worth the extra complexity to the revlog
> class. So I'd vote for leaving this out of core Mercurial.

So far I agree.

>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel

Patch

diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -881,11 +881,23 @@ 
         length = end - start
         return self._getchunk(start, length)
 
+    def compressdelta(self, delta):
+        return compress(delta)
+
+    def compresskeyframe(self, text):
+        return compress(text)
+
+    def decompressdelta(self, bin):
+        return decompress(bin)
+
+    def decompresskeyframe(self, bin):
+        return decompress(bin)
+
     def _deltachunk(self, rev):
-        return decompress(self._chunkraw(rev, rev))
+        return self.decompressdelta(self._chunkraw(rev, rev))
 
     def _keyframechunk(self, rev):
-        return decompress(self._chunkraw(rev, rev))
+        return self.decompresskeyframe(self._chunkraw(rev, rev))
 
     def _deltachunks(self, revs):
         '''faster version of [self._deltachunk(rev) for rev in revs]
@@ -906,6 +918,7 @@ 
         self._chunkraw(revs[0], revs[-1])
         offset, data = self._chunkcache
 
+        decompress = self.decompressdelta
         for rev in revs:
             chunkstart = start(rev)
             if inline:
@@ -1112,7 +1125,7 @@ 
                 t = buildtext()
                 ptext = self.revision(self.node(rev))
                 delta = mdiff.textdiff(ptext, t)
-            data = compress(delta)
+            data = self.compressdelta(delta)
             l = len(data[1]) + len(data[0])
             if basecache[0] == rev:
                 chainbase = basecache[1]
@@ -1158,7 +1171,7 @@ 
             textlen = len(text)
         if d is None or dist > textlen * 2:
             text = buildtext()
-            data = compress(text)
+            data = self.compresskeyframe(text)
             l = len(data[1]) + len(data[0])
             base = chainbase = curr