Submitter | Wojciech Lopata |
---|---|
Date | Sept. 21, 2013, 7:47 a.m. |
Message ID | <68a30bf47fe1af70d698.1379749671@dev1179.prn1.facebook.com> |
Download | mbox | patch |
Permalink | /patch/2588/ |
State | Accepted |
Commit | e92650e39f1cd8ff7565c583e8bf0fa0bdac364d |
Headers | show |
Comments
On 21 September 2013 17:47, Wojciech Lopata <lopek@fb.com> wrote: > # HG changeset patch > # User Wojciech Lopata <lopek@fb.com> > # Date 1379699151 25200 > # Fri Sep 20 10:45:51 2013 -0700 > # Node ID 68a30bf47fe1af70d698ddb213c95ce7331240d8 > # Parent 1c62f9487e46a467aace39316459a5be57c55e8a > generaldelta: initialize basecache properly > > Previously basecache was incorrectly initialized before adding the first > revision from a changegroup. Basecache value influences when full > revisions are > stored in revlog (when using generaldelta). As a result it was possible to > generate a generaldelta-revlog that could be bigger by arbitrary factor > than its > non-generaldelta equivalent. > Is there a reason generaldelta is still undocumented and/or not yet the default format? It's been in since Mercurial 1.9 and I only found out about it thanks to this patch. My work repo (~55000 changesets, mostly from SVN via hgsubversion) has had its 00manifest.d reduced from ~1.4GB to ~28MB by: hg clone -U --pull --config format.generaldata=1 repo repo-gdelta ~1.4GB -> ~600MB hg clone -U --pull --config format.generaldata=1 repo-gdelta repo-gdelta-gdelta ~600MB -> ~28MB Apart from the reduction in disk space, TortoiseHg is a pleasure to use again on that repo. Note that there's a slight difference in the number of revisions below - I forgot to take the stats for my pre-generaldata clone so I needed to grab it from the central repo which has pulled a few more changesets in. The difference is insignificant though. > hg -R repo debugrevlog -m format : 1 flags : (none) revisions : 55591 merges : 469 ( 0.84%) normal : 55122 (99.16%) revisions : 55591 full : 812 ( 1.46%) deltas : 54779 (98.54%) revision size : 1461596071 full : 111349536 ( 7.62%) deltas : 1350246535 (92.38%) avg chain length : 2613 compression ratio : 68 uncompressed data size (min/max/avg) : 0 / 7531986 / 1795501 full revision size (min/max/avg) : 0 / 819231 / 137129 delta size (min/max/avg) : 0 / 811413 / 24648 deltas against prev : 54779 (100.00%) where prev = p1 : 46283 (84.49%) where prev = p2 : 240 ( 0.44%) other : 8256 (15.07%) > hg -R repo-gdelta debugrevlog -m format : 1 flags : generaldelta revisions : 55471 merges : 469 ( 0.85%) normal : 55002 (99.15%) revisions : 55471 full : 609 ( 1.10%) deltas : 54862 (98.90%) revision size : 589765769 full : 17862021 ( 3.03%) deltas : 571903748 (96.97%) avg chain length : 2779 compression ratio : 168 uncompressed data size (min/max/avg) : 0 / 7531986 / 1793619 full revision size (min/max/avg) : 0 / 623224 / 29330 delta size (min/max/avg) : 0 / 811413 / 10424 deltas against prev : 47701 (86.95%) where prev = p1 : 46245 (96.95%) where prev = p2 : 91 ( 0.19%) other : 1365 ( 2.86%) deltas against p1 : 7129 (12.99%) deltas against p2 : 32 ( 0.06%) deltas against other : 0 ( 0.00%) > hg -R repo-gdelta-gdelta debugrevlog -m format : 1 flags : generaldelta revisions : 55471 merges : 469 ( 0.85%) normal : 55002 (99.15%) revisions : 55471 full : 31 ( 0.06%) deltas : 55440 (99.94%) revision size : 28485828 full : 2440865 ( 8.57%) deltas : 26044963 (91.43%) avg chain length : 4618 compression ratio : 3492 uncompressed data size (min/max/avg) : 0 / 7531986 / 1793619 full revision size (min/max/avg) : 0 / 819231 / 78737 delta size (min/max/avg) : 0 / 806813 / 469 deltas against prev : 55274 (99.70%) where prev = p1 : 55222 (99.91%) where prev = p2 : 3 ( 0.01%) other : 49 ( 0.09%) deltas against p1 : 162 ( 0.29%) deltas against p2 : 4 ( 0.01%) deltas against other : 0 ( 0.00%) Tim Delaney
timothy.c.delaney@gmail.com writes: > On 21 September 2013 17:47, Wojciech Lopata <lopek@fb.com> wrote: > >> # HG changeset patch >> # User Wojciech Lopata <lopek@fb.com> >> # Date 1379699151 25200 >> # Fri Sep 20 10:45:51 2013 -0700 >> # Node ID 68a30bf47fe1af70d698ddb213c95ce7331240d8 >> # Parent 1c62f9487e46a467aace39316459a5be57c55e8a >> generaldelta: initialize basecache properly >> >> Previously basecache was incorrectly initialized before adding the first >> revision from a changegroup. Basecache value influences when full >> revisions are >> stored in revlog (when using generaldelta). As a result it was possible to >> generate a generaldelta-revlog that could be bigger by arbitrary factor >> than its >> non-generaldelta equivalent. >> > > Is there a reason generaldelta is still undocumented and/or not yet the > default format? It's been in since Mercurial 1.9 and I only found out about > it thanks to this patch. Mostly, because the bundle code needed to be updated to handle exchanging generaldelta (and other things, like bookmarked heads) over the wire. I believe this finally made some decent progress at the last sprint but still isn't done yet.
On 22 September 2013 05:08, Sean Farley <sean.michael.farley@gmail.com>wrote: > > timothy.c.delaney@gmail.com writes: > > > Is there a reason generaldelta is still undocumented and/or not yet the > > default format? It's been in since Mercurial 1.9 and I only found out > about > > it thanks to this patch. > > Mostly, because the bundle code needed to be updated to handle > exchanging generaldelta (and other things, like bookmarked heads) over > the wire. I believe this finally made some decent progress at the last > sprint but still isn't done yet. > So given that we don't use bookmarks, is there any reason I shouldn't convert my central repo to use generaldelta to gain the existing benefits? Is it just that some things are less optimised than they could be, or is there the potential for data loss/corruption? Tim Delaney
timothy.c.delaney@gmail.com writes: > On 22 September 2013 05:08, Sean Farley <sean.michael.farley@gmail.com>wrote: > >> >> timothy.c.delaney@gmail.com writes: >> >> > Is there a reason generaldelta is still undocumented and/or not yet the >> > default format? It's been in since Mercurial 1.9 and I only found out >> about >> > it thanks to this patch. >> >> Mostly, because the bundle code needed to be updated to handle >> exchanging generaldelta (and other things, like bookmarked heads) over >> the wire. I believe this finally made some decent progress at the last >> sprint but still isn't done yet. >> > > So given that we don't use bookmarks, is there any reason I shouldn't > convert my central repo to use generaldelta to gain the existing benefits? > Is it just that some things are less optimised than they could be, or is > there the potential for data loss/corruption? Sure, you could enable it but as Wojciech's patch showed there was a potential for generaldelta to be bigger than the original repo. Since bundle 2.0 isn't quite ready to handle generaldelta, you'll only benefit from it locally (i.e. it'll do a conversion on-the-fly when someone clones it). For what it's worth, I've had it enabled for a while now and haven't any trouble.
On 22 September 2013 05:36, Sean Farley <sean.michael.farley@gmail.com>wrote: > > timothy.c.delaney@gmail.com writes: > > > On 22 September 2013 05:08, Sean Farley <sean.michael.farley@gmail.com > >wrote: > > > >> > >> timothy.c.delaney@gmail.com writes: > >> > >> > Is there a reason generaldelta is still undocumented and/or not yet > the > >> > default format? It's been in since Mercurial 1.9 and I only found out > >> about > >> > it thanks to this patch. > >> > >> Mostly, because the bundle code needed to be updated to handle > >> exchanging generaldelta (and other things, like bookmarked heads) over > >> the wire. I believe this finally made some decent progress at the last > >> sprint but still isn't done yet. > >> > > > > So given that we don't use bookmarks, is there any reason I shouldn't > > convert my central repo to use generaldelta to gain the existing > benefits? > > Is it just that some things are less optimised than they could be, or is > > there the potential for data loss/corruption? > > Sure, you could enable it but as Wojciech's patch showed there was a > potential for generaldelta to be bigger than the original repo. Since > bundle 2.0 isn't quite ready to handle generaldelta, you'll only benefit > from it locally (i.e. it'll do a conversion on-the-fly when someone > clones it). > > For what it's worth, I've had it enabled for a while now and haven't any > trouble. > Thanks Sean, I actually tried it with Wojciech's patch (just that one patch applied on top of 2.7.1) and it made zero difference on my repo. Just to be clear - pulling from a remote generaldelta repo (e.g. over ssh) to a local generaldelta repo will currently have the same effect as pulling from a non-generaldelta repo to a generaldelta repo - you get parent deltas in the local repo, but not the benefit of reordering that you get with a local clone. Is that correct? If so, it's still a win for me in terms of manifest size. There's a code freeze on the SVN repo coming up this week so it sounds like I should take the opportunity to change my central repo to use generaldelta (it'll probably take a couple of days to do - sitting on an Amazon micro instance). I'll see immediate benefits in terms of disk space there and it sets me up for the future wire protocol improvements. Do you think it would be worthwhile documenting generaldelta in it's current state (noting the limitations)? Maybe with a release that contains Wojciech's patch. Cheers, Tim Delaney
On Sun, 22 Sep 2013 06:00:51 +1000 Tim Delaney <timothy.c.delaney@gmail.com> wrote: > > I actually tried it with Wojciech's patch (just that one patch applied on > top of 2.7.1) and it made zero difference on my repo. > > Just to be clear - pulling from a remote generaldelta repo (e.g. over ssh) > to a local generaldelta repo will currently have the same effect as pulling > from a non-generaldelta repo to a generaldelta repo - you get parent deltas > in the local repo, but not the benefit of reordering that you get with a > local clone. Is that correct? If so, it's still a win for me in terms of > manifest size. For the record, the CPython repo at http://hg.python.org/cpython has been using generaldelta for two years now without any issues. The size savings are quite interesting: https://mail.python.org/pipermail/python-committers/2011-July/001764.html Also, all remote clones benefit from the savings. A fresh clone today is still about 210 MB, without generaldelta it would perhaps be more than 400 MB. (by the way, Wojciech's patch doesn't seem to make a difference here) Regards Antoine.
On 22 September 2013 06:00, Tim Delaney <timothy.c.delaney@gmail.com> wrote: > Just to be clear - pulling from a remote generaldelta repo (e.g. over ssh) > to a local generaldelta repo will currently have the same effect as pulling > from a non-generaldelta repo to a generaldelta repo - you get parent deltas > in the local repo, but not the benefit of reordering that you get with a > local clone. Is that correct? If so, it's still a win for me in terms of > manifest size. > Interesting. I've just been able to try this - cloned a remote generaldelta repo (running under Mercurial 2.4) pulled over ssh via Mercurial 2.7. The remote repo had a 600MB manifest. The clone has a 27MB manifest. Cloning over SSH gained the benefit of reordering, just like a local clone. So under what circumstances would you *not* get the benefit of reordering? Tim Delaney
On Sun, 2013-09-22 at 05:01 +1000, Tim Delaney wrote: > On 21 September 2013 17:47, Wojciech Lopata <lopek@fb.com> wrote: > > > # HG changeset patch > > # User Wojciech Lopata <lopek@fb.com> > > # Date 1379699151 25200 > > # Fri Sep 20 10:45:51 2013 -0700 > > # Node ID 68a30bf47fe1af70d698ddb213c95ce7331240d8 > > # Parent 1c62f9487e46a467aace39316459a5be57c55e8a > > generaldelta: initialize basecache properly > > > > Previously basecache was incorrectly initialized before adding the first > > revision from a changegroup. Basecache value influences when full > > revisions are > > stored in revlog (when using generaldelta). As a result it was possible to > > generate a generaldelta-revlog that could be bigger by arbitrary factor > > than its > > non-generaldelta equivalent. > > > > Is there a reason generaldelta is still undocumented and/or not yet the > default format? It's been in since Mercurial 1.9 and I only found out about > it thanks to this patch. My work repo (~55000 changesets, mostly from SVN > via hgsubversion) has had its 00manifest.d reduced from ~1.4GB to ~28MB by: The reason is that you could still end up sending that 1.4GB over the wire =and= taking substantially more CPU than before... because the wire protocol can only do linear deltas and thus will have to recompute the deltas for the old format. This will be fixed when we get the new bundle format figured out. You might find that a standard clone of your generaldelta repo is smaller than your original repo.
On Sat, 2013-09-21 at 00:47 -0700, Wojciech Lopata wrote: > # HG changeset patch > # User Wojciech Lopata <lopek@fb.com> > # Date 1379699151 25200 > # Fri Sep 20 10:45:51 2013 -0700 > # Node ID 68a30bf47fe1af70d698ddb213c95ce7331240d8 > # Parent 1c62f9487e46a467aace39316459a5be57c55e8a > generaldelta: initialize basecache properly Queued for stable, thanks. Check-code sends its regards.
On 22 September 2013 08:06, Matt Mackall <mpm@selenic.com> wrote: > On Sun, 2013-09-22 at 05:01 +1000, Tim Delaney wrote: > > Is there a reason generaldelta is still undocumented and/or not yet the > > default format? It's been in since Mercurial 1.9 and I only found out > about > > it thanks to this patch. My work repo (~55000 changesets, mostly from SVN > > via hgsubversion) has had its 00manifest.d reduced from ~1.4GB to ~28MB > by: > > The reason is that you could still end up sending that 1.4GB over the > wire =and= taking substantially more CPU than before... because the wire > protocol can only do linear deltas and thus will have to recompute the > deltas for the old format. This will be fixed when we get the new bundle > format figured out. > > You might find that a standard clone of your generaldelta repo is > smaller than your original repo. > Not quite, but it's close - ~31MB compared to ~27MB. I think I might have got my understanding backwards before. Background - my repo is a lot of fairly unrelated branches - the SVN repo is essentially several unrelated repos implemented as different branches plus a number of related feature branches where no merging occurs - just branching off. Any and all of the branches may be committed to resulting in a lot of completely unrelated interleaved commits (resulting in the 1.4GB manifest). Based on what I'm seeing, this is what I think is happening when pulling from a remote generaldelta repo to a local generaldelta repo over ssh. Please correct me if I've got it wrong. If I'm right, generaldelta will be a substantial win for my repo even with the existing wire protocol. 1. Remote reorders to produce the longest chains it can such that prev will be a parent. 2. Remote recomputes the deltas. There could be substantial savings here if the original order results in lots of interleaved unrelated branches, but reordering results in long chains on the same branch. This is what I would expect with my repo. 3. Local receives the deltas and then recomputes generaldelta. The changesets are already in near-optimal order. The debugrevlog -m supports this: original standard repo: deltas against prev : 54779 (100.00%) where prev = p1 : 46283 (84.49%) where prev = p2 : 240 ( 0.44%) other : 8256 (15.07%) generaldelta cloned from standard locally (numbers slightly lower here - hadn't pulled all changesets in at this point): deltas against prev : 47701 (86.95%) where prev = p1 : 46245 (96.95%) where prev = p2 : 91 ( 0.19%) other : 1365 ( 2.86%) deltas against p1 : 7129 (12.99%) deltas against p2 : 32 ( 0.06%) deltas against other : 0 ( 0.00%) generaldelta cloned from generaldelta over ssh: deltas against prev : 55390 (99.69%) where prev = p1 : 55341 (99.91%) where prev = p2 : 3 ( 0.01%) other : 46 ( 0.08%) deltas against p1 : 167 ( 0.30%) deltas against p2 : 3 ( 0.01%) deltas against other : 0 ( 0.00%) Tim Delaney
On 22 September 2013 09:47, Tim Delaney <timothy.c.delaney@gmail.com> wrote: > On 22 September 2013 08:06, Matt Mackall <mpm@selenic.com> wrote: > >> On Sun, 2013-09-22 at 05:01 +1000, Tim Delaney wrote: >> > Is there a reason generaldelta is still undocumented and/or not yet the >> > default format? It's been in since Mercurial 1.9 and I only found out >> about >> > it thanks to this patch. My work repo (~55000 changesets, mostly from >> SVN >> > via hgsubversion) has had its 00manifest.d reduced from ~1.4GB to ~28MB >> by: >> >> The reason is that you could still end up sending that 1.4GB over the >> wire =and= taking substantially more CPU than before... because the wire >> protocol can only do linear deltas and thus will have to recompute the >> deltas for the old format. This will be fixed when we get the new bundle >> format figured out. >> >> You might find that a standard clone of your generaldelta repo is >> smaller than your original repo. >> > > Not quite, but it's close - ~31MB compared to ~27MB. > Sorry - of course that's compared to the final generaldelta repo. The standard clone manifest is *much* smaller than the original repo, but depending on where it received changesets from could then potentially grow quickly again. Tim Delaney
Patch
diff --git a/mercurial/revlog.py b/mercurial/revlog.py --- a/mercurial/revlog.py +++ b/mercurial/revlog.py @@ -200,7 +200,7 @@ self.datafile = indexfile[:-2] + ".d" self.opener = opener self._cache = None - self._basecache = (0, 0) + self._basecache = None self._chunkcache = (0, '') self.index = [] self._pcache = {} @@ -1131,6 +1131,8 @@ offset = self.end(prev) flags = 0 d = None + if self._basecache is None: + self._basecache = (prev, self.chainbase(prev)) basecache = self._basecache p1r, p2r = self.rev(p1), self.rev(p2) diff --git a/tests/test-generaldelta.t b/tests/test-generaldelta.t new file mode 100755 --- /dev/null +++ b/tests/test-generaldelta.t @@ -0,0 +1,23 @@ +Check whether size of generaldelta revlog is not bigger than its regular +equivalent. Test would fail if generaldelta was naive implementation of +parentdelta: third manifest revision would be fully inserted due to big distance +from its paren revision (zero). + + $ hg init repo + $ cd repo + $ echo foo > foo + $ echo bar > bar + $ hg commit -q -Am boo + $ hg clone --pull . ../gdrepo -q --config format.generaldelta=yes + $ for r in 1 2 3; do + > echo $r > foo + > hg commit -q -m $r + > hg up -q -r 0 + > hg pull . -q -r $r -R ../gdrepo + > done + $ cd .. + $ regsize=$(du -s -b repo/.hg/store/00manifest.i | cut -f 1) + $ gdsize=$(du -s -b gdrepo/.hg/store/00manifest.i | cut -f 1) + $ if ((regsize < gdsize)); then + > echo 'generaldelta increased size of a revlog!' + > fi