Patchwork repair: compress bundles with gzip instead of bzip2 (BC)

login
register
mail settings
Submitter Gregory Szorc
Date May 14, 2017, 6:33 a.m.
Message ID <03c14dca1ff875ed9f46.1494743607@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/20618/
State Accepted
Headers show

Comments

Gregory Szorc - May 14, 2017, 6:33 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1494743508 25200
#      Sat May 13 23:31:48 2017 -0700
# Node ID 03c14dca1ff875ed9f4650166ada89baa02c5843
# Parent  78496ac300255e9996b3e282086661afc08af37c
repair: compress bundles with gzip instead of bzip2 (BC)

repair.strip() produces up to 2 bundles: a persisted "backup"
bundle (containing stripped data) and a temporary bundle
(containing data that will be re-applied post strip). The
temporary bundle is uncompressed and the backup bundle is
compressed with bzip2.

I imagine bzip2 was chosen as the compression format for backup
bundles because it yields the best compression. That makes sense:
you don't want backup bundles wasting space. But at the same time,
you want operations to complete fast as well.

bzip2 is likely 3-5x slower than zlib/gip. If you are compressing
a large amount of data, that speed difference can translate to
seconds. It can also translate to lost megabytes. However,
storage space gets cheaper every year and single core CPU
performance isn't scaling as well as it was 10 years ago when
this default was likely chosen. I think we should go with the
faster compression algorithm at the expense of larger bundles.

I *would* like to change the default to zstd (if available). However,
that's a significant BC break and I don't feel comfortable changing
it just yet. When we have official zstd revlog support and the
repo is using it, I think it makes sense to use zstd for backup
bundles. But for zlib-based revlogs, it's safer to avoid zstd
for the time being.

This change does make some `hg strip` operations faster if large
amounts of data are being compressed. However, execution time is
often dominated by creating and applying the temporary bundle, so
the benefit likely isn't night and day.
Augie Fackler - May 15, 2017, 5:39 p.m.
On Sat, May 13, 2017 at 11:33:27PM -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1494743508 25200
> #      Sat May 13 23:31:48 2017 -0700
> # Node ID 03c14dca1ff875ed9f4650166ada89baa02c5843
> # Parent  78496ac300255e9996b3e282086661afc08af37c
> repair: compress bundles with gzip instead of bzip2 (BC)

Can you give me an idea of how much bigger the bundles are on disk?
Most users never clean these up.

> repair.strip() produces up to 2 bundles: a persisted "backup"
> bundle (containing stripped data) and a temporary bundle
> (containing data that will be re-applied post strip). The
> temporary bundle is uncompressed and the backup bundle is
> compressed with bzip2.

Should that temporary bundle be hit with zstd if available? It's
scoped to a single execution (more or less), so it ought to be safe
and a nearly-free IO win for minimal CPU.

>
> I imagine bzip2 was chosen as the compression format for backup
> bundles because it yields the best compression. That makes sense:
> you don't want backup bundles wasting space. But at the same time,
> you want operations to complete fast as well.
>
> bzip2 is likely 3-5x slower than zlib/gip. If you are compressing
> a large amount of data, that speed difference can translate to
> seconds. It can also translate to lost megabytes. However,
> storage space gets cheaper every year and single core CPU
> performance isn't scaling as well as it was 10 years ago when
> this default was likely chosen. I think we should go with the
> faster compression algorithm at the expense of larger bundles.
>
> I *would* like to change the default to zstd (if available). However,
> that's a significant BC break and I don't feel comfortable changing
> it just yet. When we have official zstd revlog support and the
> repo is using it, I think it makes sense to use zstd for backup
> bundles. But for zlib-based revlogs, it's safer to avoid zstd
> for the time being.
>
> This change does make some `hg strip` operations faster if large
> amounts of data are being compressed. However, execution time is
> often dominated by creating and applying the temporary bundle, so
> the benefit likely isn't night and day.
>
> diff --git a/mercurial/repair.py b/mercurial/repair.py
> --- a/mercurial/repair.py
> +++ b/mercurial/repair.py
> @@ -43,9 +43,9 @@ def _bundle(repo, bases, heads, node, su
>      if cgversion != '01':
>          bundletype = "HG20"
>          if compress:
> -            comp = 'BZ'
> +            comp = 'GZ'
>      elif compress:
> -        bundletype = "HG10BZ"
> +        bundletype = "HG10GZ"
>      else:
>          bundletype = "HG10UN"
>      return bundle2.writebundle(repo.ui, cg, name, bundletype, vfs,
> diff --git a/tests/test-generaldelta.t b/tests/test-generaldelta.t
> --- a/tests/test-generaldelta.t
> +++ b/tests/test-generaldelta.t
> @@ -154,7 +154,7 @@ Test that strip bundle use bundle2
>    0 files updated, 0 files merged, 5 files removed, 0 files unresolved
>    saved backup bundle to $TESTTMP/aggressive/.hg/strip-backup/1c5d4dc9a8b8-6c68e60c-backup.hg (glob)
>    $ hg debugbundle .hg/strip-backup/*
> -  Stream params: sortdict([('Compression', 'BZ')])
> +  Stream params: sortdict([('Compression', 'GZ')])
>    changegroup -- "sortdict([('version', '02'), ('nbchanges', '1')])"
>        1c5d4dc9a8b8d6e1750966d343e94db665e7a1e9
>
> diff --git a/tests/test-strip.t b/tests/test-strip.t
> --- a/tests/test-strip.t
> +++ b/tests/test-strip.t
> @@ -210,7 +210,7 @@
>    summary:     b
>
>    $ hg debugbundle .hg/strip-backup/*
> -  Stream params: sortdict([('Compression', 'BZ')])
> +  Stream params: sortdict([('Compression', 'GZ')])
>    changegroup -- "sortdict([('version', '02'), ('nbchanges', '1')])"
>        264128213d290d868c54642d13aeaa3675551a78
>    $ hg pull .hg/strip-backup/*
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Siddharth Agarwal - May 15, 2017, 7:32 p.m.
On 5/13/17 23:33, Gregory Szorc wrote:
> I*would*  like to change the default to zstd (if available). However,
> that's a significant BC break and I don't feel comfortable changing
> it just yet. When we have official zstd revlog support and the
> repo is using it, I think it makes sense to use zstd for backup
> bundles. But for zlib-based revlogs, it's safer to avoid zstd
> for the time being.


Even if the backup bundle uses gzip, could the temp bundle use zstd? 
IIRC the temp bundle can potentially be much larger than the backup 
bundle because it contains all of the commits with a rev number higher 
than the lowest one being stripped.

- Siddharth

Patch

diff --git a/mercurial/repair.py b/mercurial/repair.py
--- a/mercurial/repair.py
+++ b/mercurial/repair.py
@@ -43,9 +43,9 @@  def _bundle(repo, bases, heads, node, su
     if cgversion != '01':
         bundletype = "HG20"
         if compress:
-            comp = 'BZ'
+            comp = 'GZ'
     elif compress:
-        bundletype = "HG10BZ"
+        bundletype = "HG10GZ"
     else:
         bundletype = "HG10UN"
     return bundle2.writebundle(repo.ui, cg, name, bundletype, vfs,
diff --git a/tests/test-generaldelta.t b/tests/test-generaldelta.t
--- a/tests/test-generaldelta.t
+++ b/tests/test-generaldelta.t
@@ -154,7 +154,7 @@  Test that strip bundle use bundle2
   0 files updated, 0 files merged, 5 files removed, 0 files unresolved
   saved backup bundle to $TESTTMP/aggressive/.hg/strip-backup/1c5d4dc9a8b8-6c68e60c-backup.hg (glob)
   $ hg debugbundle .hg/strip-backup/*
-  Stream params: sortdict([('Compression', 'BZ')])
+  Stream params: sortdict([('Compression', 'GZ')])
   changegroup -- "sortdict([('version', '02'), ('nbchanges', '1')])"
       1c5d4dc9a8b8d6e1750966d343e94db665e7a1e9
 
diff --git a/tests/test-strip.t b/tests/test-strip.t
--- a/tests/test-strip.t
+++ b/tests/test-strip.t
@@ -210,7 +210,7 @@ 
   summary:     b
   
   $ hg debugbundle .hg/strip-backup/*
-  Stream params: sortdict([('Compression', 'BZ')])
+  Stream params: sortdict([('Compression', 'GZ')])
   changegroup -- "sortdict([('version', '02'), ('nbchanges', '1')])"
       264128213d290d868c54642d13aeaa3675551a78
   $ hg pull .hg/strip-backup/*