Patchwork generaldelta: initialize basecache properly

login
register
mail settings
Submitter Wojciech Lopata
Date Sept. 20, 2013, 7:31 p.m.
Message ID <e3bf4ac814f13279287e.1379705461@dev1179.prn1.facebook.com>
Download mbox | patch
Permalink /patch/2577/
State Superseded
Commit e92650e39f1cd8ff7565c583e8bf0fa0bdac364d
Headers show

Comments

Wojciech Lopata - Sept. 20, 2013, 7:31 p.m.
# HG changeset patch
# User Wojciech Lopata <lopek@fb.com>
# Date 1379699151 25200
#      Fri Sep 20 10:45:51 2013 -0700
# Node ID e3bf4ac814f13279287e28742f816856bba178ff
# Parent  1c62f9487e46a467aace39316459a5be57c55e8a
generaldelta: initialize basecache properly

Previously basecache was incorrectly inicialized before add of first revision
from a changegroup. Basecache value infuences when full revisions are stored
in revlog (when using generaldelta). As a result it was possible to generate
a generaldelta-revlog that could be bigger by arbitrary factor than its
non-generaldelta equivalent.
Wojciech Lopata - Sept. 20, 2013, 7:32 p.m.
http://pastebin.com/xV6H7ic3 - this scripts generates a regular repository and then pulls commits from it one by one to a generaldelta repository.

Size of 00manifest.d in regular repo was 156K, while with generaldelta it hit 808K.

In contrast size of 00manifest.d in a generaldelta repository that was created with 'hg clone --pull' was 156K as well, since bug was not revealed when pulling commits in big packets.

> -----Original Message-----
> From: mercurial-devel-bounces@selenic.com [mailto:mercurial-devel-
> bounces@selenic.com] On Behalf Of Wojciech Lopata
> Sent: Friday, September 20, 2013 12:31 PM
> To: mercurial-devel@selenic.com
> Subject: [PATCH] generaldelta: initialize basecache properly
> 
> # HG changeset patch
> # User Wojciech Lopata <lopek@fb.com>
> # Date 1379699151 25200
> #      Fri Sep 20 10:45:51 2013 -0700
> # Node ID e3bf4ac814f13279287e28742f816856bba178ff
> # Parent  1c62f9487e46a467aace39316459a5be57c55e8a
> generaldelta: initialize basecache properly
> 
> Previously basecache was incorrectly inicialized before add of first revision
> from a changegroup. Basecache value infuences when full revisions are
> stored in revlog (when using generaldelta). As a result it was possible to
> generate a generaldelta-revlog that could be bigger by arbitrary factor than
> its non-generaldelta equivalent.
> 
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -200,7 +200,7 @@
>          self.datafile = indexfile[:-2] + ".d"
>          self.opener = opener
>          self._cache = None
> -        self._basecache = (0, 0)
> +        self._basecache = None
>          self._chunkcache = (0, '')
>          self.index = []
>          self._pcache = {}
> @@ -1131,6 +1131,8 @@
>          offset = self.end(prev)
>          flags = 0
>          d = None
> +        if self._basecache is None:
> +            self._basecache = (prev, self.chainbase(prev))
>          basecache = self._basecache
>          p1r, p2r = self.rev(p1), self.rev(p2)
> 
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Durham Goode - Sept. 20, 2013, 11:16 p.m.
On 9/20/13 12:31 PM, "Wojciech Lopata" <lopek@fb.com> wrote:

># HG changeset patch
># User Wojciech Lopata <lopek@fb.com>
># Date 1379699151 25200
>#      Fri Sep 20 10:45:51 2013 -0700
># Node ID e3bf4ac814f13279287e28742f816856bba178ff
># Parent  1c62f9487e46a467aace39316459a5be57c55e8a
>generaldelta: initialize basecache properly
>
>Previously basecache was incorrectly inicialized before add of first
>revision

Šinitialized before adding the first revisionŠ

>from a changegroup. Basecache value infuences when full revisions are
>stored

influences

>in revlog (when using generaldelta). As a result it was possible to
>generate
>a generaldelta-revlog that could be bigger by arbitrary factor than its
>non-generaldelta equivalent.
>
>diff --git a/mercurial/revlog.py b/mercurial/revlog.py
>--- a/mercurial/revlog.py
>+++ b/mercurial/revlog.py
>@@ -200,7 +200,7 @@
>         self.datafile = indexfile[:-2] + ".d"
>         self.opener = opener
>         self._cache = None
>-        self._basecache = (0, 0)
>+        self._basecache = None
>         self._chunkcache = (0, '')
>         self.index = []
>         self._pcache = {}
>@@ -1131,6 +1131,8 @@
>         offset = self.end(prev)
>         flags = 0
>         d = None
>+        if self._basecache is None:
>+            self._basecache = (prev, self.chainbase(prev))
>         basecache = self._basecache
>         p1r, p2r = self.rev(p1), self.rev(p2)

It's a bit hard to understand the implications of this change without
knowing the general delta code in detail. Is it possible to write a test
so we can see the improvement and to prevent regressions?
Wojciech Lopata - Sept. 21, 2013, 7:49 a.m.
I've resent the patch with a minimal test added.

Btw. I cloned crew repo with generaldelta turn on and results were pretty impressive - manifest size went down from 19M to 4.1M. In this case generaldelta gives a lot more than everything what I've tried on the manifest level.

> -----Original Message-----
> From: Durham Goode
> Sent: Friday, September 20, 2013 4:16 PM
> To: Wojciech Lopata; mercurial-devel@selenic.com
> Subject: Re: [PATCH] generaldelta: initialize basecache properly
> 
> On 9/20/13 12:31 PM, "Wojciech Lopata" <lopek@fb.com> wrote:
> 
> ># HG changeset patch
> ># User Wojciech Lopata <lopek@fb.com>
> ># Date 1379699151 25200
> >#      Fri Sep 20 10:45:51 2013 -0700
> ># Node ID e3bf4ac814f13279287e28742f816856bba178ff
> ># Parent  1c62f9487e46a467aace39316459a5be57c55e8a
> >generaldelta: initialize basecache properly
> >
> >Previously basecache was incorrectly inicialized before add of first
> >revision
> 
> Šinitialized before adding the first revisionŠ
> 
> >from a changegroup. Basecache value infuences when full revisions are
> >stored
> 
> influences
> 
> >in revlog (when using generaldelta). As a result it was possible to
> >generate a generaldelta-revlog that could be bigger by arbitrary factor
> >than its non-generaldelta equivalent.
> >
> >diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> >--- a/mercurial/revlog.py
> >+++ b/mercurial/revlog.py
> >@@ -200,7 +200,7 @@
> >         self.datafile = indexfile[:-2] + ".d"
> >         self.opener = opener
> >         self._cache = None
> >-        self._basecache = (0, 0)
> >+        self._basecache = None
> >         self._chunkcache = (0, '')
> >         self.index = []
> >         self._pcache = {}
> >@@ -1131,6 +1131,8 @@
> >         offset = self.end(prev)
> >         flags = 0
> >         d = None
> >+        if self._basecache is None:
> >+            self._basecache = (prev, self.chainbase(prev))
> >         basecache = self._basecache
> >         p1r, p2r = self.rev(p1), self.rev(p2)
> 
> It's a bit hard to understand the implications of this change without knowing
> the general delta code in detail. Is it possible to write a test so we can see the
> improvement and to prevent regressions?
Sean Farley - Sept. 21, 2013, 7 p.m.
lopek@fb.com writes:

> I've resent the patch with a minimal test added.
>
> Btw. I cloned crew repo with generaldelta turn on and results were pretty impressive - manifest size went down from 19M to 4.1M. In this case generaldelta gives a lot more than everything what I've tried on the manifest level.

Just for reference, without your patch, generaldelta still compresses
the manifest for the crew repo down to 4.8M.

Patch

diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -200,7 +200,7 @@ 
         self.datafile = indexfile[:-2] + ".d"
         self.opener = opener
         self._cache = None
-        self._basecache = (0, 0)
+        self._basecache = None
         self._chunkcache = (0, '')
         self.index = []
         self._pcache = {}
@@ -1131,6 +1131,8 @@ 
         offset = self.end(prev)
         flags = 0
         d = None
+        if self._basecache is None:
+            self._basecache = (prev, self.chainbase(prev))
         basecache = self._basecache
         p1r, p2r = self.rev(p1), self.rev(p2)