Patchwork [5,of,5,RFC] revlog: increase I/O bound to 4x the amount of data consumed

login
register
mail settings
Submitter Siddharth Agarwal
Date Nov. 12, 2014, 11:09 p.m.
Message ID <14df598f1c13fa7fb04d.1415833752@devbig136.prn2.facebook.com>
Download mbox | patch
Permalink /patch/6695/
State Accepted
Commit 2b9bc796350423eef9abe8db2a637fe0a5eb7af2
Headers show

Comments

Siddharth Agarwal - Nov. 12, 2014, 11:09 p.m.
# HG changeset patch
# User Siddharth Agarwal <sid0@fb.com>
# Date 1415765299 28800
#      Tue Nov 11 20:08:19 2014 -0800
# Node ID 14df598f1c13fa7fb04daf652efb94bd0f55e8c9
# Parent  ec432ebb569fd4537093a3c446ccaefce5298636
revlog: increase I/O bound to 4x the amount of data consumed

This doesn't affect normal clones since they'd be bound by the CPU bound below
anyway -- it does, however, improve generaldelta clones significantly.

This also results in better deltaing for generaldelta clones -- in generaldelta
clones, we calculate deltas with respect to the closest base if it has a higher
revision number than either parent. If the base is on a significantly different
branch, this can result in pointlessly massive deltas. This reduces the number
of bases and hence the number of bad deltas.

Empirically, for a highly branchy repository, this resulted in an improvement
of around 15% to manifest size.

The sole test change is because in that case the compressed text is smaller
than the compressed delta. (We already use fulltexts if the uncompressed text
is smaller than the compressed delta.)
Matt Mackall - Nov. 12, 2014, 11:21 p.m.
On Wed, 2014-11-12 at 15:09 -0800, Siddharth Agarwal wrote:
> # HG changeset patch
> # User Siddharth Agarwal <sid0@fb.com>
> # Date 1415765299 28800
> #      Tue Nov 11 20:08:19 2014 -0800
> # Node ID 14df598f1c13fa7fb04daf652efb94bd0f55e8c9
> # Parent  ec432ebb569fd4537093a3c446ccaefce5298636
> revlog: increase I/O bound to 4x the amount of data consumed

These are shiny enough that I'm just going to queue them, thanks.
Mads Kiilerich - Nov. 13, 2014, 1:18 a.m.
On 11/13/2014 12:09 AM, Siddharth Agarwal wrote:
> # HG changeset patch
> # User Siddharth Agarwal <sid0@fb.com>
> # Date 1415765299 28800
> #      Tue Nov 11 20:08:19 2014 -0800
> # Node ID 14df598f1c13fa7fb04daf652efb94bd0f55e8c9
> # Parent  ec432ebb569fd4537093a3c446ccaefce5298636
> revlog: increase I/O bound to 4x the amount of data consumed
>
> This doesn't affect normal clones since they'd be bound by the CPU bound below
> anyway -- it does, however, improve generaldelta clones significantly.
>
> This also results in better deltaing for generaldelta clones -- in generaldelta
> clones, we calculate deltas with respect to the closest base if it has a higher
> revision number than either parent. If the base is on a significantly different
> branch, this can result in pointlessly massive deltas. This reduces the number
> of bases and hence the number of bad deltas.
>
> Empirically, for a highly branchy repository, this resulted in an improvement
> of around 15% to manifest size.
>
> The sole test change is because in that case the compressed text is smaller
> than the compressed delta. (We already use fulltexts if the uncompressed text
> is smaller than the compressed delta.)

There is no test change in the patch. You ended up addressing it in 
patch 2 instead?

/Mads

>
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -1267,7 +1267,7 @@
>           #   the amount of I/O we need to do.
>           # - 'compresseddeltalen' is the sum of the total size of deltas we need
>           #   to apply -- bounding it limits the amount of CPU we consume.
> -        if (d is None or dist > textlen * 2 or l > textlen or
> +        if (d is None or dist > textlen * 4 or l > textlen or
>               compresseddeltalen > textlen * 2 or
>               (self._maxchainlen and chainlen > self._maxchainlen)):
>               text = buildtext()
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Siddharth Agarwal - Nov. 13, 2014, 1:24 a.m.
On 11/12/2014 05:18 PM, Mads Kiilerich wrote:
> On 11/13/2014 12:09 AM, Siddharth Agarwal wrote:
>> # HG changeset patch
>> # User Siddharth Agarwal <sid0@fb.com>
>> # Date 1415765299 28800
>> #      Tue Nov 11 20:08:19 2014 -0800
>> # Node ID 14df598f1c13fa7fb04daf652efb94bd0f55e8c9
>> # Parent  ec432ebb569fd4537093a3c446ccaefce5298636
>> revlog: increase I/O bound to 4x the amount of data consumed
>>
>> This doesn't affect normal clones since they'd be bound by the CPU 
>> bound below
>> anyway -- it does, however, improve generaldelta clones significantly.
>>
>> This also results in better deltaing for generaldelta clones -- in 
>> generaldelta
>> clones, we calculate deltas with respect to the closest base if it 
>> has a higher
>> revision number than either parent. If the base is on a significantly 
>> different
>> branch, this can result in pointlessly massive deltas. This reduces 
>> the number
>> of bases and hence the number of bad deltas.
>>
>> Empirically, for a highly branchy repository, this resulted in an 
>> improvement
>> of around 15% to manifest size.
>>
>> The sole test change is because in that case the compressed text is 
>> smaller
>> than the compressed delta. (We already use fulltexts if the 
>> uncompressed text
>> is smaller than the compressed delta.)
>
> There is no test change in the patch. You ended up addressing it in 
> patch 2 instead?

I tweaked a parameter slightly and the test change disappeared. I forgot 
to update the patch description.

- Siddharth

>
> /Mads
>
>>
>> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
>> --- a/mercurial/revlog.py
>> +++ b/mercurial/revlog.py
>> @@ -1267,7 +1267,7 @@
>>           #   the amount of I/O we need to do.
>>           # - 'compresseddeltalen' is the sum of the total size of 
>> deltas we need
>>           #   to apply -- bounding it limits the amount of CPU we 
>> consume.
>> -        if (d is None or dist > textlen * 2 or l > textlen or
>> +        if (d is None or dist > textlen * 4 or l > textlen or
>>               compresseddeltalen > textlen * 2 or
>>               (self._maxchainlen and chainlen > self._maxchainlen)):
>>               text = buildtext()
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@selenic.com
>> http://selenic.com/mailman/listinfo/mercurial-devel
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Antoine Pitrou - Nov. 13, 2014, 9:07 a.m.
How should one test the effects (positive or negative) of this change
on a repository?



On Wed, 12 Nov 2014 15:09:12 -0800
Siddharth Agarwal <sid0@fb.com> wrote:
> # HG changeset patch
> # User Siddharth Agarwal <sid0@fb.com>
> # Date 1415765299 28800
> #      Tue Nov 11 20:08:19 2014 -0800
> # Node ID 14df598f1c13fa7fb04daf652efb94bd0f55e8c9
> # Parent  ec432ebb569fd4537093a3c446ccaefce5298636
> revlog: increase I/O bound to 4x the amount of data consumed
> 
> This doesn't affect normal clones since they'd be bound by the CPU bound below
> anyway -- it does, however, improve generaldelta clones significantly.
> 
> This also results in better deltaing for generaldelta clones -- in generaldelta
> clones, we calculate deltas with respect to the closest base if it has a higher
> revision number than either parent. If the base is on a significantly different
> branch, this can result in pointlessly massive deltas. This reduces the number
> of bases and hence the number of bad deltas.
> 
> Empirically, for a highly branchy repository, this resulted in an improvement
> of around 15% to manifest size.
> 
> The sole test change is because in that case the compressed text is smaller
> than the compressed delta. (We already use fulltexts if the uncompressed text
> is smaller than the compressed delta.)
> 
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -1267,7 +1267,7 @@
>          #   the amount of I/O we need to do.
>          # - 'compresseddeltalen' is the sum of the total size of deltas we need
>          #   to apply -- bounding it limits the amount of CPU we consume.
> -        if (d is None or dist > textlen * 2 or l > textlen or
> +        if (d is None or dist > textlen * 4 or l > textlen or
>              compresseddeltalen > textlen * 2 or
>              (self._maxchainlen and chainlen > self._maxchainlen)):
>              text = buildtext()
Matt Mackall - Nov. 13, 2014, 10:30 p.m.
On Wed, 2014-11-12 at 17:24 -0800, Siddharth Agarwal wrote:

> >> The sole test change is because in that case the compressed text is 
> >> smaller
> >> than the compressed delta. (We already use fulltexts if the 
> >> uncompressed text
> >> is smaller than the compressed delta.)
> >
> > There is no test change in the patch. You ended up addressing it in 
> > patch 2 instead?
> 
> I tweaked a parameter slightly and the test change disappeared. I forgot 
> to update the patch description.

Fixed in flight.
Siddharth Agarwal - Nov. 13, 2014, 10:41 p.m.
On 11/13/2014 01:07 AM, Antoine Pitrou wrote:
> How should one test the effects (positive or negative) of this change
> on a repository?

The easiest way I've found is to strip and reapply a bundle.

hg strip -r -2000: --config bundle.reorder=False
hg unbundle <strip backup>

The bundle.reorder config is necessary to make sure that the bundle 
generation code doesn't linearize the bundle. Linearizing the bundle 
makes this optimization redundant.

>
>
>
> On Wed, 12 Nov 2014 15:09:12 -0800
> Siddharth Agarwal <sid0@fb.com> wrote:
>> # HG changeset patch
>> # User Siddharth Agarwal <sid0@fb.com>
>> # Date 1415765299 28800
>> #      Tue Nov 11 20:08:19 2014 -0800
>> # Node ID 14df598f1c13fa7fb04daf652efb94bd0f55e8c9
>> # Parent  ec432ebb569fd4537093a3c446ccaefce5298636
>> revlog: increase I/O bound to 4x the amount of data consumed
>>
>> This doesn't affect normal clones since they'd be bound by the CPU bound below
>> anyway -- it does, however, improve generaldelta clones significantly.
>>
>> This also results in better deltaing for generaldelta clones -- in generaldelta
>> clones, we calculate deltas with respect to the closest base if it has a higher
>> revision number than either parent. If the base is on a significantly different
>> branch, this can result in pointlessly massive deltas. This reduces the number
>> of bases and hence the number of bad deltas.
>>
>> Empirically, for a highly branchy repository, this resulted in an improvement
>> of around 15% to manifest size.
>>
>> The sole test change is because in that case the compressed text is smaller
>> than the compressed delta. (We already use fulltexts if the uncompressed text
>> is smaller than the compressed delta.)
>>
>> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
>> --- a/mercurial/revlog.py
>> +++ b/mercurial/revlog.py
>> @@ -1267,7 +1267,7 @@
>>           #   the amount of I/O we need to do.
>>           # - 'compresseddeltalen' is the sum of the total size of deltas we need
>>           #   to apply -- bounding it limits the amount of CPU we consume.
>> -        if (d is None or dist > textlen * 2 or l > textlen or
>> +        if (d is None or dist > textlen * 4 or l > textlen or
>>               compresseddeltalen > textlen * 2 or
>>               (self._maxchainlen and chainlen > self._maxchainlen)):
>>               text = buildtext()
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel

Patch

diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -1267,7 +1267,7 @@ 
         #   the amount of I/O we need to do.
         # - 'compresseddeltalen' is the sum of the total size of deltas we need
         #   to apply -- bounding it limits the amount of CPU we consume.
-        if (d is None or dist > textlen * 2 or l > textlen or
+        if (d is None or dist > textlen * 4 or l > textlen or
             compresseddeltalen > textlen * 2 or
             (self._maxchainlen and chainlen > self._maxchainlen)):
             text = buildtext()