Submitter | Wojciech Lopata |
---|---|
Date | Sept. 26, 2013, 6:14 p.m. |
Message ID | <b7528dd2041110229810.1380219285@dev1179.prn1.facebook.com> |
Download | mbox | patch |
Permalink | /patch/2646/ |
State | Deferred |
Headers | show |
Comments
On 9/26/13 11:14 AM, "Wojciech Lopata" <lopek@fb.com> wrote: ># HG changeset patch ># User Wojciech Lopata <lopek@fb.com> ># Date 1380212200 25200 ># Thu Sep 26 09:16:40 2013 -0700 ># Node ID b7528dd2041110229810e71245eefa4d70aca648 ># Parent 16ba1a7b494caeb45d24ca81d49bbc866af1e7c5 >revlog: separate methods for deltas and key frames >compression/decompression > >This change makes possible: > 1) to use different compression methods in various revlog subclasses > 2) to use different compression methods for key frames and deltas Do you have numbers showing what benefits an extension could get by doing key frame and delta compression differently?
Siddharth noticed this series is not only going to break the lz4revlog extension, but will also make it pretty difficult to fix. Let me modify it and resend soon. > -----Original Message----- > From: mercurial-devel-bounces@selenic.com [mailto:mercurial-devel- > bounces@selenic.com] On Behalf Of Wojciech Lopata > Sent: Thursday, September 26, 2013 11:15 AM > To: mercurial-devel@selenic.com > Subject: [PATCH 4 of 4] revlog: separate methods for deltas and key frames > compression/decompression > > # HG changeset patch > # User Wojciech Lopata <lopek@fb.com> > # Date 1380212200 25200 > # Thu Sep 26 09:16:40 2013 -0700 > # Node ID b7528dd2041110229810e71245eefa4d70aca648 > # Parent 16ba1a7b494caeb45d24ca81d49bbc866af1e7c5 > revlog: separate methods for deltas and key frames > compression/decompression > > This change makes possible: > 1) to use different compression methods in various revlog subclasses > 2) to use different compression methods for key frames and deltas > > diff --git a/mercurial/revlog.py b/mercurial/revlog.py > --- a/mercurial/revlog.py > +++ b/mercurial/revlog.py > @@ -881,11 +881,23 @@ > length = end - start > return self._getchunk(start, length) > > + def compressdelta(self, delta): > + return compress(delta) > + > + def compresskeyframe(self, text): > + return compress(text) > + > + def decompressdelta(self, bin): > + return decompress(bin) > + > + def decompresskeyframe(self, bin): > + return decompress(bin) > + > def _deltachunk(self, rev): > - return decompress(self._chunkraw(rev, rev)) > + return self.decompressdelta(self._chunkraw(rev, rev)) > > def _keyframechunk(self, rev): > - return decompress(self._chunkraw(rev, rev)) > + return self.decompresskeyframe(self._chunkraw(rev, rev)) > > def _deltachunks(self, revs): > '''faster version of [self._deltachunk(rev) for rev in revs] @@ -906,6 > +918,7 @@ > self._chunkraw(revs[0], revs[-1]) > offset, data = self._chunkcache > > + decompress = self.decompressdelta > for rev in revs: > chunkstart = start(rev) > if inline: > @@ -1112,7 +1125,7 @@ > t = buildtext() > ptext = self.revision(self.node(rev)) > delta = mdiff.textdiff(ptext, t) > - data = compress(delta) > + data = self.compressdelta(delta) > l = len(data[1]) + len(data[0]) > if basecache[0] == rev: > chainbase = basecache[1] @@ -1158,7 +1171,7 @@ > textlen = len(text) > if d is None or dist > textlen * 2: > text = buildtext() > - data = compress(text) > + data = self.compresskeyframe(text) > l = len(data[1]) + len(data[0]) > base = chainbase = curr > > _______________________________________________ > Mercurial-devel mailing list > Mercurial-devel@selenic.com > http://selenic.com/mailman/listinfo/mercurial-devel
> Do you have numbers showing what benefits an extension could get by doing key frame and delta compression differently? This vary. The biggest gain of being able to apply different compressions for key frames and deltas I've seen for now was additional 5% manifest size reduction. I believe gain grows as repository grows, but I haven't checked what the gain is on big repositories, since with my draft, pure python implementation such conversion would take days. > -----Original Message----- > From: Durham Goode > Sent: Thursday, September 26, 2013 1:15 PM > To: Wojciech Lopata; mercurial-devel@selenic.com > Subject: Re: [PATCH 4 of 4] revlog: separate methods for deltas and key > frames compression/decompression > > On 9/26/13 11:14 AM, "Wojciech Lopata" <lopek@fb.com> wrote: > > ># HG changeset patch > ># User Wojciech Lopata <lopek@fb.com> > ># Date 1380212200 25200 > ># Thu Sep 26 09:16:40 2013 -0700 > ># Node ID b7528dd2041110229810e71245eefa4d70aca648 > ># Parent 16ba1a7b494caeb45d24ca81d49bbc866af1e7c5 > >revlog: separate methods for deltas and key frames > >compression/decompression > > > >This change makes possible: > > 1) to use different compression methods in various revlog subclasses > > 2) to use different compression methods for key frames and deltas > > Do you have numbers showing what benefits an extension could get by > doing key frame and delta compression differently?
On 9/26/13 3:32 PM, "Wojciech Lopata" <lopek@fb.com> wrote: >> Do you have numbers showing what benefits an extension could get by >>doing key frame and delta compression differently? > >This vary. The biggest gain of being able to apply different compressions >for key frames and deltas I've seen for now was additional 5% manifest >size reduction. I believe gain grows as repository grows, but I haven't >checked what the gain is on big repositories, since with my draft, pure >python implementation such conversion would take days. But why can we not get the same gain from compressing both key frames and deltas the same? Why would we ever want to apply some compression to a key frame, but not to a delta?
> But why can we not get the same gain from compressing both key frames and deltas the same? Why would we ever want to apply some compression to a key frame, but not to a delta? I probably misused 'compression' term. I need to introduce additional step that would occur after new revision is added to a revlog, but before gziping the revision text. This will let me to apply 'stem compression' (described in http://mercurial.selenic.com/wiki/ImprovingManifestCompressionPlan) to key frames only. Implementing stem compression on the manifest level (before adding a revision to a revlog) actually bums manifest size, since it increases size of deltas in some cases (when new file is added or renamed). I will not call it compression any more. > -----Original Message----- > From: Durham Goode > Sent: Thursday, September 26, 2013 3:35 PM > To: Wojciech Lopata; mercurial-devel@selenic.com > Subject: Re: [PATCH 4 of 4] revlog: separate methods for deltas and key > frames compression/decompression > > On 9/26/13 3:32 PM, "Wojciech Lopata" <lopek@fb.com> wrote: > > >> Do you have numbers showing what benefits an extension could get by > >>doing key frame and delta compression differently? > > > >This vary. The biggest gain of being able to apply different > >compressions for key frames and deltas I've seen for now was additional > >5% manifest size reduction. I believe gain grows as repository grows, > >but I haven't checked what the gain is on big repositories, since with > >my draft, pure python implementation such conversion would take days. > > > But why can we not get the same gain from compressing both key frames > and deltas the same? Why would we ever want to apply some compression > to a key frame, but not to a delta?
On 9/26/13 3:48 PM, "Wojciech Lopata" <lopek@fb.com> wrote: >> But why can we not get the same gain from compressing both key frames >>and deltas the same? Why would we ever want to apply some compression >>to a key frame, but not to a delta? > >I probably misused 'compression' term. > >I need to introduce additional step that would occur after new revision >is added to a revlog, but before gziping the revision text. > >This will let me to apply 'stem compression' (described in >http://mercurial.selenic.com/wiki/ImprovingManifestCompressionPlan) to >key frames only. > >Implementing stem compression on the manifest level (before adding a >revision to a revlog) actually bums manifest size, since it increases >size of deltas in some cases (when new file is added or renamed). > >I will not call it compression any more. From talking to Wojciech in person, there are certain compression algorithms that can be applied to key frames but not to deltas, and if they were applied to the actual manifest (prior to deltafication) can result in larger deltas overall. So splitting these functions up allows us to compress key frames on their own. Stem compressions (as described in the Manifest compression wiki page) is an example of this. I would still like to see what the actual benefits are (i.e. numbers) are before adding the extra complexity to revlog.
Numbers for Mozilla repository: Regular manifest: 303752 (100%) Delta friendly line breaks + binary node strings: 234016 (77.04%) DFLB + BNS + stem compression: 230116 (75.75%) So in this case we get additional 1.3% size reduction. Full revisions constitute 10% of mozilla's manifest, but in some cases it is about 20%, so the additional gain would be about 2.5%. It's not a lot, but this is an easy gain on the other hand I believe. > -----Original Message----- > From: Durham Goode > Sent: Thursday, September 26, 2013 6:48 PM > To: Wojciech Lopata; mercurial-devel@selenic.com > Subject: Re: [PATCH 4 of 4] revlog: separate methods for deltas and key > frames compression/decompression > > On 9/26/13 3:48 PM, "Wojciech Lopata" <lopek@fb.com> wrote: > > >> But why can we not get the same gain from compressing both key frames > >>and deltas the same? Why would we ever want to apply some > compression > >>to a key frame, but not to a delta? > > > >I probably misused 'compression' term. > > > >I need to introduce additional step that would occur after new revision > >is added to a revlog, but before gziping the revision text. > > > >This will let me to apply 'stem compression' (described in > >http://mercurial.selenic.com/wiki/ImprovingManifestCompressionPlan) to > >key frames only. > > > >Implementing stem compression on the manifest level (before adding a > >revision to a revlog) actually bums manifest size, since it increases > >size of deltas in some cases (when new file is added or renamed). > > > >I will not call it compression any more. > > From talking to Wojciech in person, there are certain compression algorithms > that can be applied to key frames but not to deltas, and if they were applied > to the actual manifest (prior to deltafication) can result in larger deltas > overall. So splitting these functions up allows us to compress key frames on > their own. Stem compressions (as described in the Manifest compression > wiki page) is an example of this. > > I would still like to see what the actual benefits are (i.e. numbers) are before > adding the extra complexity to revlog.
On 9/28/13 12:14 PM, "Wojciech Lopata" <lopek@fb.com> wrote: >Numbers for Mozilla repository: > >Regular manifest: 303752 (100%) >Delta friendly line breaks + binary node strings: 234016 (77.04%) >DFLB + BNS + stem compression: 230116 (75.75%) > >So in this case we get additional 1.3% size reduction. Full revisions >constitute 10% of mozilla's manifest, but in some cases it is about 20%, >so the additional gain would be about 2.5%. > >It's not a lot, but this is an easy gain on the other hand I believe. > I don't think a potential 2% is worth the extra complexity to the revlog class. So I'd vote for leaving this out of core Mercurial.
On Mon, Sep 30, 2013 at 06:49:09PM +0000, Durham Goode wrote: > On 9/28/13 12:14 PM, "Wojciech Lopata" <lopek@fb.com> wrote: > > >Numbers for Mozilla repository: > > > >Regular manifest: 303752 (100%) > >Delta friendly line breaks + binary node strings: 234016 (77.04%) > >DFLB + BNS + stem compression: 230116 (75.75%) > > > >So in this case we get additional 1.3% size reduction. Full revisions > >constitute 10% of mozilla's manifest, but in some cases it is about 20%, > >so the additional gain would be about 2.5%. > > > >It's not a lot, but this is an easy gain on the other hand I believe. > > > > I don't think a potential 2% is worth the extra complexity to the revlog > class. So I'd vote for leaving this out of core Mercurial. So far I agree. > > _______________________________________________ > Mercurial-devel mailing list > Mercurial-devel@selenic.com > http://selenic.com/mailman/listinfo/mercurial-devel
Patch
diff --git a/mercurial/revlog.py b/mercurial/revlog.py --- a/mercurial/revlog.py +++ b/mercurial/revlog.py @@ -881,11 +881,23 @@ length = end - start return self._getchunk(start, length) + def compressdelta(self, delta): + return compress(delta) + + def compresskeyframe(self, text): + return compress(text) + + def decompressdelta(self, bin): + return decompress(bin) + + def decompresskeyframe(self, bin): + return decompress(bin) + def _deltachunk(self, rev): - return decompress(self._chunkraw(rev, rev)) + return self.decompressdelta(self._chunkraw(rev, rev)) def _keyframechunk(self, rev): - return decompress(self._chunkraw(rev, rev)) + return self.decompresskeyframe(self._chunkraw(rev, rev)) def _deltachunks(self, revs): '''faster version of [self._deltachunk(rev) for rev in revs] @@ -906,6 +918,7 @@ self._chunkraw(revs[0], revs[-1]) offset, data = self._chunkcache + decompress = self.decompressdelta for rev in revs: chunkstart = start(rev) if inline: @@ -1112,7 +1125,7 @@ t = buildtext() ptext = self.revision(self.node(rev)) delta = mdiff.textdiff(ptext, t) - data = compress(delta) + data = self.compressdelta(delta) l = len(data[1]) + len(data[0]) if basecache[0] == rev: chainbase = basecache[1] @@ -1158,7 +1171,7 @@ textlen = len(text) if d is None or dist > textlen * 2: text = buildtext() - data = compress(text) + data = self.compresskeyframe(text) l = len(data[1]) + len(data[0]) base = chainbase = curr