Patchwork [V2] generaldelta: initialize basecache properly

login
register
mail settings
Submitter Wojciech Lopata
Date Sept. 21, 2013, 7:47 a.m.
Message ID <68a30bf47fe1af70d698.1379749671@dev1179.prn1.facebook.com>
Download mbox | patch
Permalink /patch/2588/
State Accepted
Commit e92650e39f1cd8ff7565c583e8bf0fa0bdac364d
Headers show

Comments

Wojciech Lopata - Sept. 21, 2013, 7:47 a.m.
# HG changeset patch
# User Wojciech Lopata <lopek@fb.com>
# Date 1379699151 25200
#      Fri Sep 20 10:45:51 2013 -0700
# Node ID 68a30bf47fe1af70d698ddb213c95ce7331240d8
# Parent  1c62f9487e46a467aace39316459a5be57c55e8a
generaldelta: initialize basecache properly

Previously basecache was incorrectly initialized before adding the first
revision from a changegroup. Basecache value influences when full revisions are
stored in revlog (when using generaldelta). As a result it was possible to
generate a generaldelta-revlog that could be bigger by arbitrary factor than its
non-generaldelta equivalent.
Tim Delaney - Sept. 21, 2013, 7:01 p.m.
On 21 September 2013 17:47, Wojciech Lopata <lopek@fb.com> wrote:

> # HG changeset patch
> # User Wojciech Lopata <lopek@fb.com>
> # Date 1379699151 25200
> #      Fri Sep 20 10:45:51 2013 -0700
> # Node ID 68a30bf47fe1af70d698ddb213c95ce7331240d8
> # Parent  1c62f9487e46a467aace39316459a5be57c55e8a
> generaldelta: initialize basecache properly
>
> Previously basecache was incorrectly initialized before adding the first
> revision from a changegroup. Basecache value influences when full
> revisions are
> stored in revlog (when using generaldelta). As a result it was possible to
> generate a generaldelta-revlog that could be bigger by arbitrary factor
> than its
> non-generaldelta equivalent.
>

Is there a reason generaldelta is still undocumented and/or not yet the
default format? It's been in since Mercurial 1.9 and I only found out about
it thanks to this patch. My work repo (~55000 changesets, mostly from SVN
via hgsubversion) has had its 00manifest.d reduced from ~1.4GB to ~28MB by:

hg clone -U --pull --config format.generaldata=1 repo repo-gdelta
~1.4GB -> ~600MB

hg clone -U --pull --config format.generaldata=1 repo-gdelta
repo-gdelta-gdelta
~600MB -> ~28MB

Apart from the reduction in disk space, TortoiseHg is a pleasure to use
again on that repo.

Note that there's a slight difference in the number of revisions below - I
forgot to take the stats for my pre-generaldata clone so I needed to grab
it from the central repo which has pulled a few more changesets in. The
difference is insignificant though.

> hg -R repo debugrevlog -m
format : 1
flags  : (none)

revisions     :      55591
    merges    :        469 ( 0.84%)
    normal    :      55122 (99.16%)
revisions     :      55591
    full      :        812 ( 1.46%)
    deltas    :      54779 (98.54%)
revision size : 1461596071
    full      :  111349536 ( 7.62%)
    deltas    : 1350246535 (92.38%)

avg chain length  : 2613
compression ratio :   68

uncompressed data size (min/max/avg) : 0 / 7531986 / 1795501
full revision size (min/max/avg)     : 0 / 819231 / 137129
delta size (min/max/avg)             : 0 / 811413 / 24648

deltas against prev  : 54779 (100.00%)
    where prev = p1  : 46283     (84.49%)
    where prev = p2  :   240     ( 0.44%)
    other            :  8256     (15.07%)

> hg -R repo-gdelta debugrevlog -m
format : 1
flags  : generaldelta

revisions     :     55471
    merges    :       469 ( 0.85%)
    normal    :     55002 (99.15%)
revisions     :     55471
    full      :       609 ( 1.10%)
    deltas    :     54862 (98.90%)
revision size : 589765769
    full      :  17862021 ( 3.03%)
    deltas    : 571903748 (96.97%)

avg chain length  : 2779
compression ratio :  168

uncompressed data size (min/max/avg) : 0 / 7531986 / 1793619
full revision size (min/max/avg)     : 0 / 623224 / 29330
delta size (min/max/avg)             : 0 / 811413 / 10424

deltas against prev  : 47701 (86.95%)
    where prev = p1  : 46245     (96.95%)
    where prev = p2  :    91     ( 0.19%)
    other            :  1365     ( 2.86%)
deltas against p1    :  7129 (12.99%)
deltas against p2    :    32 ( 0.06%)
deltas against other :     0 ( 0.00%)

> hg -R repo-gdelta-gdelta debugrevlog -m
format : 1
flags  : generaldelta

revisions     :    55471
    merges    :      469 ( 0.85%)
    normal    :    55002 (99.15%)
revisions     :    55471
    full      :       31 ( 0.06%)
    deltas    :    55440 (99.94%)
revision size : 28485828
    full      :  2440865 ( 8.57%)
    deltas    : 26044963 (91.43%)

avg chain length  : 4618
compression ratio : 3492

uncompressed data size (min/max/avg) : 0 / 7531986 / 1793619
full revision size (min/max/avg)     : 0 / 819231 / 78737
delta size (min/max/avg)             : 0 / 806813 / 469

deltas against prev  : 55274 (99.70%)
    where prev = p1  : 55222     (99.91%)
    where prev = p2  :     3     ( 0.01%)
    other            :    49     ( 0.09%)
deltas against p1    :   162 ( 0.29%)
deltas against p2    :     4 ( 0.01%)
deltas against other :     0 ( 0.00%)

Tim Delaney
Sean Farley - Sept. 21, 2013, 7:08 p.m.
timothy.c.delaney@gmail.com writes:

> On 21 September 2013 17:47, Wojciech Lopata <lopek@fb.com> wrote:
>
>> # HG changeset patch
>> # User Wojciech Lopata <lopek@fb.com>
>> # Date 1379699151 25200
>> #      Fri Sep 20 10:45:51 2013 -0700
>> # Node ID 68a30bf47fe1af70d698ddb213c95ce7331240d8
>> # Parent  1c62f9487e46a467aace39316459a5be57c55e8a
>> generaldelta: initialize basecache properly
>>
>> Previously basecache was incorrectly initialized before adding the first
>> revision from a changegroup. Basecache value influences when full
>> revisions are
>> stored in revlog (when using generaldelta). As a result it was possible to
>> generate a generaldelta-revlog that could be bigger by arbitrary factor
>> than its
>> non-generaldelta equivalent.
>>
>
> Is there a reason generaldelta is still undocumented and/or not yet the
> default format? It's been in since Mercurial 1.9 and I only found out about
> it thanks to this patch.

Mostly, because the bundle code needed to be updated to handle
exchanging generaldelta (and other things, like bookmarked heads) over
the wire. I believe this finally made some decent progress at the last
sprint but still isn't done yet.
Tim Delaney - Sept. 21, 2013, 7:28 p.m.
On 22 September 2013 05:08, Sean Farley <sean.michael.farley@gmail.com>wrote:

>
> timothy.c.delaney@gmail.com writes:
>
> > Is there a reason generaldelta is still undocumented and/or not yet the
> > default format? It's been in since Mercurial 1.9 and I only found out
> about
> > it thanks to this patch.
>
> Mostly, because the bundle code needed to be updated to handle
> exchanging generaldelta (and other things, like bookmarked heads) over
> the wire. I believe this finally made some decent progress at the last
> sprint but still isn't done yet.
>

So given that we don't use bookmarks, is there any reason I shouldn't
convert my central repo to use generaldelta to gain the existing benefits?
Is it just that some things are less optimised than they could be, or is
there the potential for data loss/corruption?

Tim Delaney
Sean Farley - Sept. 21, 2013, 7:36 p.m.
timothy.c.delaney@gmail.com writes:

> On 22 September 2013 05:08, Sean Farley <sean.michael.farley@gmail.com>wrote:
>
>>
>> timothy.c.delaney@gmail.com writes:
>>
>> > Is there a reason generaldelta is still undocumented and/or not yet the
>> > default format? It's been in since Mercurial 1.9 and I only found out
>> about
>> > it thanks to this patch.
>>
>> Mostly, because the bundle code needed to be updated to handle
>> exchanging generaldelta (and other things, like bookmarked heads) over
>> the wire. I believe this finally made some decent progress at the last
>> sprint but still isn't done yet.
>>
>
> So given that we don't use bookmarks, is there any reason I shouldn't
> convert my central repo to use generaldelta to gain the existing benefits?
> Is it just that some things are less optimised than they could be, or is
> there the potential for data loss/corruption?

Sure, you could enable it but as Wojciech's patch showed there was a
potential for generaldelta to be bigger than the original repo. Since
bundle 2.0 isn't quite ready to handle generaldelta, you'll only benefit
from it locally (i.e. it'll do a conversion on-the-fly when someone
clones it).

For what it's worth, I've had it enabled for a while now and haven't any
trouble.
Tim Delaney - Sept. 21, 2013, 8 p.m.
On 22 September 2013 05:36, Sean Farley <sean.michael.farley@gmail.com>wrote:

>
> timothy.c.delaney@gmail.com writes:
>
> > On 22 September 2013 05:08, Sean Farley <sean.michael.farley@gmail.com
> >wrote:
> >
> >>
> >> timothy.c.delaney@gmail.com writes:
> >>
> >> > Is there a reason generaldelta is still undocumented and/or not yet
> the
> >> > default format? It's been in since Mercurial 1.9 and I only found out
> >> about
> >> > it thanks to this patch.
> >>
> >> Mostly, because the bundle code needed to be updated to handle
> >> exchanging generaldelta (and other things, like bookmarked heads) over
> >> the wire. I believe this finally made some decent progress at the last
> >> sprint but still isn't done yet.
> >>
> >
> > So given that we don't use bookmarks, is there any reason I shouldn't
> > convert my central repo to use generaldelta to gain the existing
> benefits?
> > Is it just that some things are less optimised than they could be, or is
> > there the potential for data loss/corruption?
>
> Sure, you could enable it but as Wojciech's patch showed there was a
> potential for generaldelta to be bigger than the original repo. Since
> bundle 2.0 isn't quite ready to handle generaldelta, you'll only benefit
> from it locally (i.e. it'll do a conversion on-the-fly when someone
> clones it).
>
> For what it's worth, I've had it enabled for a while now and haven't any
> trouble.
>

Thanks Sean,

I actually tried it with Wojciech's patch (just that one patch applied on
top of 2.7.1) and it made zero difference on my repo.

Just to be clear - pulling from a remote generaldelta repo (e.g. over ssh)
to a local generaldelta repo will currently have the same effect as pulling
from a non-generaldelta repo to a generaldelta repo - you get parent deltas
in the local repo, but not the benefit of reordering that you get with a
local clone. Is that correct? If so, it's still a win for me in terms of
manifest size.

There's a code freeze on the SVN repo coming up this week so it sounds like
I should take the opportunity to change my central repo to use generaldelta
(it'll probably take a couple of days to do - sitting on an Amazon micro
instance). I'll see immediate benefits in terms of disk space there and it
sets me up for the future wire protocol improvements.

Do you think it would be worthwhile documenting generaldelta in it's
current state (noting the limitations)? Maybe with a release that contains
Wojciech's patch.

Cheers,

Tim Delaney
Antoine Pitrou - Sept. 21, 2013, 8:10 p.m.
On Sun, 22 Sep 2013 06:00:51 +1000
Tim Delaney <timothy.c.delaney@gmail.com> wrote:
> 
> I actually tried it with Wojciech's patch (just that one patch applied on
> top of 2.7.1) and it made zero difference on my repo.
> 
> Just to be clear - pulling from a remote generaldelta repo (e.g. over ssh)
> to a local generaldelta repo will currently have the same effect as pulling
> from a non-generaldelta repo to a generaldelta repo - you get parent deltas
> in the local repo, but not the benefit of reordering that you get with a
> local clone. Is that correct? If so, it's still a win for me in terms of
> manifest size.

For the record, the CPython repo at http://hg.python.org/cpython has
been using generaldelta for two years now without any issues. The
size savings are quite interesting:
https://mail.python.org/pipermail/python-committers/2011-July/001764.html

Also, all remote clones benefit from the savings.  A fresh clone today
is still about 210 MB, without generaldelta it would perhaps be more
than 400 MB.

(by the way, Wojciech's patch doesn't seem to make a difference here)

Regards

Antoine.
Tim Delaney - Sept. 21, 2013, 8:50 p.m.
On 22 September 2013 06:00, Tim Delaney <timothy.c.delaney@gmail.com> wrote:

> Just to be clear - pulling from a remote generaldelta repo (e.g. over ssh)
> to a local generaldelta repo will currently have the same effect as pulling
> from a non-generaldelta repo to a generaldelta repo - you get parent deltas
> in the local repo, but not the benefit of reordering that you get with a
> local clone. Is that correct? If so, it's still a win for me in terms of
> manifest size.
>

Interesting. I've just been able to try this - cloned a remote generaldelta
repo (running under Mercurial 2.4) pulled over ssh via Mercurial 2.7. The
remote repo had a 600MB manifest. The clone has a 27MB manifest. Cloning
over SSH gained the benefit of reordering, just like a local clone.

So under what circumstances would you *not* get the benefit of reordering?

Tim Delaney
Matt Mackall - Sept. 21, 2013, 10:06 p.m.
On Sun, 2013-09-22 at 05:01 +1000, Tim Delaney wrote:
> On 21 September 2013 17:47, Wojciech Lopata <lopek@fb.com> wrote:
> 
> > # HG changeset patch
> > # User Wojciech Lopata <lopek@fb.com>
> > # Date 1379699151 25200
> > #      Fri Sep 20 10:45:51 2013 -0700
> > # Node ID 68a30bf47fe1af70d698ddb213c95ce7331240d8
> > # Parent  1c62f9487e46a467aace39316459a5be57c55e8a
> > generaldelta: initialize basecache properly
> >
> > Previously basecache was incorrectly initialized before adding the first
> > revision from a changegroup. Basecache value influences when full
> > revisions are
> > stored in revlog (when using generaldelta). As a result it was possible to
> > generate a generaldelta-revlog that could be bigger by arbitrary factor
> > than its
> > non-generaldelta equivalent.
> >
> 
> Is there a reason generaldelta is still undocumented and/or not yet the
> default format? It's been in since Mercurial 1.9 and I only found out about
> it thanks to this patch. My work repo (~55000 changesets, mostly from SVN
> via hgsubversion) has had its 00manifest.d reduced from ~1.4GB to ~28MB by:

The reason is that you could still end up sending that 1.4GB over the
wire =and= taking substantially more CPU than before... because the wire
protocol can only do linear deltas and thus will have to recompute the
deltas for the old format. This will be fixed when we get the new bundle
format figured out.

You might find that a standard clone of your generaldelta repo is
smaller than your original repo.
Matt Mackall - Sept. 21, 2013, 10:30 p.m.
On Sat, 2013-09-21 at 00:47 -0700, Wojciech Lopata wrote:
> # HG changeset patch
> # User Wojciech Lopata <lopek@fb.com>
> # Date 1379699151 25200
> #      Fri Sep 20 10:45:51 2013 -0700
> # Node ID 68a30bf47fe1af70d698ddb213c95ce7331240d8
> # Parent  1c62f9487e46a467aace39316459a5be57c55e8a
> generaldelta: initialize basecache properly

Queued for stable, thanks. Check-code sends its regards.
Tim Delaney - Sept. 21, 2013, 11:47 p.m.
On 22 September 2013 08:06, Matt Mackall <mpm@selenic.com> wrote:

> On Sun, 2013-09-22 at 05:01 +1000, Tim Delaney wrote:
> > Is there a reason generaldelta is still undocumented and/or not yet the
> > default format? It's been in since Mercurial 1.9 and I only found out
> about
> > it thanks to this patch. My work repo (~55000 changesets, mostly from SVN
> > via hgsubversion) has had its 00manifest.d reduced from ~1.4GB to ~28MB
> by:
>
> The reason is that you could still end up sending that 1.4GB over the
> wire =and= taking substantially more CPU than before... because the wire
> protocol can only do linear deltas and thus will have to recompute the
> deltas for the old format. This will be fixed when we get the new bundle
> format figured out.
>
> You might find that a standard clone of your generaldelta repo is
> smaller than your original repo.
>

Not quite, but it's close - ~31MB compared to ~27MB.

I think I might have got my understanding backwards before. Background - my
repo is a lot of fairly unrelated branches - the SVN repo is essentially
several unrelated repos implemented as different branches plus a number of
related feature branches where no merging occurs - just branching off. Any
and all of the branches may be committed to resulting in a lot of
completely unrelated interleaved commits (resulting in the 1.4GB manifest).

Based on what I'm seeing, this is what I think is happening when pulling
from a remote generaldelta repo to a local generaldelta repo over ssh.
Please correct me if I've got it wrong. If I'm right, generaldelta will be
a substantial win for my repo even with the existing wire protocol.

1. Remote reorders to produce the longest chains it can such that prev will
be a parent.

2. Remote recomputes the deltas.

There could be substantial savings here if the original order results in
lots of interleaved unrelated branches, but reordering results in long
chains on the same branch. This is what I would expect with my repo.

3. Local receives the deltas and then recomputes generaldelta. The
changesets are already in near-optimal order.

The debugrevlog -m supports this:

original standard repo:

deltas against prev  : 54779 (100.00%)
    where prev = p1  : 46283     (84.49%)
    where prev = p2  :   240     ( 0.44%)
    other            :  8256     (15.07%)

generaldelta cloned from standard locally (numbers slightly lower here -
hadn't pulled all changesets in at this point):

deltas against prev  : 47701 (86.95%)
    where prev = p1  : 46245     (96.95%)
    where prev = p2  :    91     ( 0.19%)
    other            :  1365     ( 2.86%)
deltas against p1    :  7129 (12.99%)
deltas against p2    :    32 ( 0.06%)
deltas against other :     0 ( 0.00%)

generaldelta cloned from generaldelta over ssh:

deltas against prev  : 55390 (99.69%)
    where prev = p1  : 55341     (99.91%)
    where prev = p2  :     3     ( 0.01%)
    other            :    46     ( 0.08%)
deltas against p1    :   167 ( 0.30%)
deltas against p2    :     3 ( 0.01%)
deltas against other :     0 ( 0.00%)

Tim Delaney
Tim Delaney - Sept. 21, 2013, 11:50 p.m.
On 22 September 2013 09:47, Tim Delaney <timothy.c.delaney@gmail.com> wrote:

> On 22 September 2013 08:06, Matt Mackall <mpm@selenic.com> wrote:
>
>> On Sun, 2013-09-22 at 05:01 +1000, Tim Delaney wrote:
>> > Is there a reason generaldelta is still undocumented and/or not yet the
>> > default format? It's been in since Mercurial 1.9 and I only found out
>> about
>> > it thanks to this patch. My work repo (~55000 changesets, mostly from
>> SVN
>> > via hgsubversion) has had its 00manifest.d reduced from ~1.4GB to ~28MB
>> by:
>>
>> The reason is that you could still end up sending that 1.4GB over the
>> wire =and= taking substantially more CPU than before... because the wire
>> protocol can only do linear deltas and thus will have to recompute the
>> deltas for the old format. This will be fixed when we get the new bundle
>> format figured out.
>>
>> You might find that a standard clone of your generaldelta repo is
>> smaller than your original repo.
>>
>
> Not quite, but it's close - ~31MB compared to ~27MB.
>

Sorry - of course that's compared to the final generaldelta repo. The
standard clone manifest is *much* smaller than the original repo, but
depending on where it received changesets from could then potentially grow
quickly again.

Tim Delaney

Patch

diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -200,7 +200,7 @@ 
         self.datafile = indexfile[:-2] + ".d"
         self.opener = opener
         self._cache = None
-        self._basecache = (0, 0)
+        self._basecache = None
         self._chunkcache = (0, '')
         self.index = []
         self._pcache = {}
@@ -1131,6 +1131,8 @@ 
         offset = self.end(prev)
         flags = 0
         d = None
+        if self._basecache is None:
+            self._basecache = (prev, self.chainbase(prev))
         basecache = self._basecache
         p1r, p2r = self.rev(p1), self.rev(p2)
 
diff --git a/tests/test-generaldelta.t b/tests/test-generaldelta.t
new file mode 100755
--- /dev/null
+++ b/tests/test-generaldelta.t
@@ -0,0 +1,23 @@ 
+Check whether size of generaldelta revlog is not bigger than its regular
+equivalent. Test would fail if generaldelta was naive implementation of
+parentdelta: third manifest revision would be fully inserted due to big distance
+from its paren revision (zero).
+
+  $ hg init repo
+  $ cd repo
+  $ echo foo > foo
+  $ echo bar > bar
+  $ hg commit -q -Am boo
+  $ hg clone --pull . ../gdrepo -q --config format.generaldelta=yes
+  $ for r in 1 2 3; do
+  >   echo $r > foo
+  >   hg commit -q -m $r
+  >   hg up -q -r 0
+  >   hg pull . -q -r $r -R ../gdrepo
+  > done
+  $ cd ..
+  $ regsize=$(du -s -b repo/.hg/store/00manifest.i | cut -f 1)
+  $ gdsize=$(du -s -b gdrepo/.hg/store/00manifest.i | cut -f 1)
+  $ if ((regsize < gdsize)); then
+  >   echo 'generaldelta increased size of a revlog!'
+  > fi