Patchwork [4,of,4,RFC,stream,clone,bundles] commands.unbundle: support consuming streaming clone bundles

login
register
mail settings
Submitter Gregory Szorc
Date Oct. 14, 2015, 10:34 p.m.
Message ID <c927f2b34d0a420d0741.1444862043@gps-mbp.local>
Download mbox | patch
Permalink /patch/11079/
State Deferred
Headers show

Comments

Gregory Szorc - Oct. 14, 2015, 10:34 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1444861540 25200
#      Wed Oct 14 15:25:40 2015 -0700
# Node ID c927f2b34d0a420d074152464e854901c746527a
# Parent  e577b42c94d917eb3fbbe866bf66e455a251bb1b
commands.unbundle: support consuming streaming clone bundles

Now that we can produce streaming clone bundles `hg bundle`, it makes
sense to support consuming them with `hg unbundle` too.

We don't need to perform any sanity checking in the command because the
application code in clonebundles.py does this. So the integration here
is pretty simple.
Adrian Buehlmann - Oct. 14, 2015, 11:31 p.m.
On 2015-10-15 00:34, Gregory Szorc wrote:
[..]
> +We can unpack packed1 bundles
> +
> +  $ hg init packed
> +  $ hg -R packed unbundle packed.hg
> +  6 files to transfer, 2.55 KB of data
> +  transferred 2.55 KB in *.* seconds (*/sec) (glob)
> +  (run 'hg heads' to see heads, 'hg merge' to merge)
> +

Potentially stupid idea:

Why not implement a special form of the clone command, which reads from
such a full "bundle" file and creates the repo from that bundle?

   $ hg clone packed.hg packed

I don't know if - alternatively - some new option would be needed. Maybe
like this:

   $ hg clone --unbundle packed.hg packed

> +We can't unpack packed1 bundles on non-empty repos
> +
> +  $ hg -R packed unbundle packed.hg
> +  abort: cannot apply stream clone bundle on non-empty repo
> +  [255]
> +

..which then could not happen any more
Gregory Szorc - Oct. 14, 2015, 11:40 p.m.
On Wed, Oct 14, 2015 at 4:31 PM, Adrian Buehlmann <adrian@cadifra.com>
wrote:

> On 2015-10-15 00:34, Gregory Szorc wrote:
> [..]
> > +We can unpack packed1 bundles
> > +
> > +  $ hg init packed
> > +  $ hg -R packed unbundle packed.hg
> > +  6 files to transfer, 2.55 KB of data
> > +  transferred 2.55 KB in *.* seconds (*/sec) (glob)
> > +  (run 'hg heads' to see heads, 'hg merge' to merge)
> > +
>
> Potentially stupid idea:
>
> Why not implement a special form of the clone command, which reads from
> such a full "bundle" file and creates the repo from that bundle?
>
>    $ hg clone packed.hg packed
>

You can already do this! Although, it doesn't work with this new bundle
type because you need to teach bundlerepo.py about different bundle types
(it's also broken for bundle2 currently).


>
> I don't know if - alternatively - some new option would be needed. Maybe
> like this:
>
>    $ hg clone --unbundle packed.hg packed
>
> > +We can't unpack packed1 bundles on non-empty repos
> > +
> > +  $ hg -R packed unbundle packed.hg
> > +  abort: cannot apply stream clone bundle on non-empty repo
> > +  [255]
> > +
>
> ..which then could not happen any more
>

This touches on the subject of whether streaming clone / packed bundles are
or aren't bundles. I could argue both perspectives. But if they aren't
bundles, then we need to invent new commands and figure out a way to
shoehorn them into the new clone bundles feature. I think I'm fine with
either way: I just care most about getting streaming clone support into the
clone bundles feature otherwise the feature is useless for Mozilla's
automation needs.
Adrian Buehlmann - Oct. 15, 2015, 7:42 a.m.
On 2015-10-15 01:40, Gregory Szorc wrote:
> On Wed, Oct 14, 2015 at 4:31 PM, Adrian Buehlmann <adrian@cadifra.com
> <mailto:adrian@cadifra.com>> wrote:
> 
>     On 2015-10-15 00:34, Gregory Szorc wrote:
>     [..]
>     > +We can unpack packed1 bundles
>     > +
>     > +  $ hg init packed
>     > +  $ hg -R packed unbundle packed.hg
>     > +  6 files to transfer, 2.55 KB of data
>     > +  transferred 2.55 KB in *.* seconds (*/sec) (glob)
>     > +  (run 'hg heads' to see heads, 'hg merge' to merge)
>     > +
> 
>     Potentially stupid idea:
> 
>     Why not implement a special form of the clone command, which reads from
>     such a full "bundle" file and creates the repo from that bundle?
> 
>        $ hg clone packed.hg packed
> 
> 
> You can already do this! Although, it doesn't work with this new bundle
> type because you need to teach bundlerepo.py about different bundle
> types (it's also broken for bundle2 currently).

This sounds a bit like a contradiction to me... ("You can already do
this" vs "Although, it doesn't work...")

If it doesn't work, then it should probably be fixed to work with these
kind of bundles?

These kind of bundles can't be applied to a non-empty repo, so it seems
to me that trying to make the unbundle command work with these is a bit
of a misfit.

The error message in the case below just looked a bit silly to me.. The
next question then was: Why add support for these kind of "bundles" to
the unbundle command, if they can only be used on empty repos? If you
start doing that (adding support for these kind of bundles to the
unbundle command), you will have to support it forever...

>     I don't know if - alternatively - some new option would be needed. Maybe
>     like this:
> 
>        $ hg clone --unbundle packed.hg packed
> 
>     > +We can't unpack packed1 bundles on non-empty repos
>     > +
>     > +  $ hg -R packed unbundle packed.hg
>     > +  abort: cannot apply stream clone bundle on non-empty repo
>     > +  [255]
>     > +
> 
>     ..which then could not happen any more
> 
> 
> This touches on the subject of whether streaming clone / packed bundles
> are or aren't bundles. I could argue both perspectives. But if they
> aren't bundles, then we need to invent new commands and figure out a way
> to shoehorn them into the new clone bundles feature. I think I'm fine
> with either way: I just care most about getting streaming clone support
> into the clone bundles feature otherwise the feature is useless for
> Mozilla's automation needs.

As I understand the matter, these new "full (streamclone) bundles"
aren't really bundles to me. They can never be used like a classic
bundle. What can you do with such a full bundle? The only command that
makes sense is clone: Transform it into a new repo.

But perhaps I'm missing your point here, as I'm mostly sitting on the
fence. I got (unwillingly) involved into the business of streamcloning
because streamcloning ('hg clone --uncompressed') was the preexisting
feature which made the fncache file (.hg/store/fncache) necessary. Which
in turn was used as the name of the new repo format that I helped
introduce quite a while a ago. The only purpose of that repo format was
to limit the length of path names under .hg/store, which was biting us
on Windows. I then got bitten again when Bryan and myself tortured each
other with translating "my" horrible store._hybridencode function into C
code, which he found to be needed, because it turned out to be too slow
for his important use case. Yikes.

Now I see you inventing a new kind of bundles out of this streamclone
stuff. Which is quite interesting.
Augie Fackler - Oct. 15, 2015, 5:01 p.m.
On Wed, Oct 14, 2015 at 03:34:03PM -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1444861540 25200
> #      Wed Oct 14 15:25:40 2015 -0700
> # Node ID c927f2b34d0a420d074152464e854901c746527a
> # Parent  e577b42c94d917eb3fbbe866bf66e455a251bb1b
> commands.unbundle: support consuming streaming clone bundles

Series looks mostly reasonable to me. I do wonder if it'd be equally
good to have a streamgroup part in a bundle2 container, but it's
probably not worth the overhead.

>
> Now that we can produce streaming clone bundles `hg bundle`, it makes
> sense to support consuming them with `hg unbundle` too.
>
> We don't need to perform any sanity checking in the command because the
> application code in clonebundles.py does this. So the integration here
> is pretty simple.
>
> diff --git a/mercurial/commands.py b/mercurial/commands.py
> --- a/mercurial/commands.py
> +++ b/mercurial/commands.py
> @@ -6530,8 +6530,11 @@ def unbundle(ui, repo, fname1, *fnames,
>                          tr.release()
>                  changes = [r.get('return', 0)
>                             for r in op.records['changegroup']]
>                  modheads = changegroup.combineresults(changes)
> +            elif isinstance(gen, streamclone.unbundler):
> +                gen.apply(repo)
> +                modheads = len(repo.heads())
>              else:
>                  modheads = changegroup.addchangegroup(repo, gen, 'unbundle',
>                                                        'bundle:' + fname)
>      finally:
> diff --git a/tests/test-bundle.t b/tests/test-bundle.t
> --- a/tests/test-bundle.t
> +++ b/tests/test-bundle.t
> @@ -277,8 +277,22 @@ packed1 is produced properly
>    0010: 00 00 00 00 0a 30 00 09 72 65 76 6c 6f 67 76 31 |.....0..revlogv1|
>    0020: 00 64 61 74 61 2f 61 64 69 66 66 65 72 65 6e 74 |.data/adifferent|
>    0030: 66 69 6c 65 2e 69 00 31 33 39 0a 00 01 00 01 00 |file.i.139......|
>
> +We can unpack packed1 bundles
> +
> +  $ hg init packed
> +  $ hg -R packed unbundle packed.hg
> +  6 files to transfer, 2.55 KB of data
> +  transferred 2.55 KB in *.* seconds (*/sec) (glob)
> +  (run 'hg heads' to see heads, 'hg merge' to merge)
> +
> +We can't unpack packed1 bundles on non-empty repos
> +
> +  $ hg -R packed unbundle packed.hg
> +  abort: cannot apply stream clone bundle on non-empty repo
> +  [255]
> +
>  Create partial clones
>
>    $ rm -r empty
>    $ hg init empty
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> https://selenic.com/mailman/listinfo/mercurial-devel
Gregory Szorc - Oct. 15, 2015, 5:24 p.m.
On Thu, Oct 15, 2015 at 10:01 AM, Augie Fackler <raf@durin42.com> wrote:

> On Wed, Oct 14, 2015 at 03:34:03PM -0700, Gregory Szorc wrote:
> > # HG changeset patch
> > # User Gregory Szorc <gregory.szorc@gmail.com>
> > # Date 1444861540 25200
> > #      Wed Oct 14 15:25:40 2015 -0700
> > # Node ID c927f2b34d0a420d074152464e854901c746527a
> > # Parent  e577b42c94d917eb3fbbe866bf66e455a251bb1b
> > commands.unbundle: support consuming streaming clone bundles
>
> Series looks mostly reasonable to me. I do wonder if it'd be equally
> good to have a streamgroup part in a bundle2 container, but it's
> probably not worth the overhead.
>

This was originally my intent: add a part to bundle2 and be done with it.
However, bundle2 seems to add a 10-20% processing overhead when you are as
low-level as stream clones. Stream clones are all about minimizing
end-to-end clone times and this overhead wasn't acceptable to me.

I /think/ there is room to optimize bundle2 reading. But it's a bit of
work. Maybe for 3.7.


>
> >
> > Now that we can produce streaming clone bundles `hg bundle`, it makes
> > sense to support consuming them with `hg unbundle` too.
> >
> > We don't need to perform any sanity checking in the command because the
> > application code in clonebundles.py does this. So the integration here
> > is pretty simple.
> >
> > diff --git a/mercurial/commands.py b/mercurial/commands.py
> > --- a/mercurial/commands.py
> > +++ b/mercurial/commands.py
> > @@ -6530,8 +6530,11 @@ def unbundle(ui, repo, fname1, *fnames,
> >                          tr.release()
> >                  changes = [r.get('return', 0)
> >                             for r in op.records['changegroup']]
> >                  modheads = changegroup.combineresults(changes)
> > +            elif isinstance(gen, streamclone.unbundler):
> > +                gen.apply(repo)
> > +                modheads = len(repo.heads())
> >              else:
> >                  modheads = changegroup.addchangegroup(repo, gen,
> 'unbundle',
> >                                                        'bundle:' + fname)
> >      finally:
> > diff --git a/tests/test-bundle.t b/tests/test-bundle.t
> > --- a/tests/test-bundle.t
> > +++ b/tests/test-bundle.t
> > @@ -277,8 +277,22 @@ packed1 is produced properly
> >    0010: 00 00 00 00 0a 30 00 09 72 65 76 6c 6f 67 76 31
> |.....0..revlogv1|
> >    0020: 00 64 61 74 61 2f 61 64 69 66 66 65 72 65 6e 74
> |.data/adifferent|
> >    0030: 66 69 6c 65 2e 69 00 31 33 39 0a 00 01 00 01 00
> |file.i.139......|
> >
> > +We can unpack packed1 bundles
> > +
> > +  $ hg init packed
> > +  $ hg -R packed unbundle packed.hg
> > +  6 files to transfer, 2.55 KB of data
> > +  transferred 2.55 KB in *.* seconds (*/sec) (glob)
> > +  (run 'hg heads' to see heads, 'hg merge' to merge)
> > +
> > +We can't unpack packed1 bundles on non-empty repos
> > +
> > +  $ hg -R packed unbundle packed.hg
> > +  abort: cannot apply stream clone bundle on non-empty repo
> > +  [255]
> > +
> >  Create partial clones
> >
> >    $ rm -r empty
> >    $ hg init empty
> > _______________________________________________
> > Mercurial-devel mailing list
> > Mercurial-devel@selenic.com
> > https://selenic.com/mailman/listinfo/mercurial-devel
>
Gregory Szorc - Oct. 15, 2015, 5:46 p.m.
On Thu, Oct 15, 2015 at 12:42 AM, Adrian Buehlmann <adrian@cadifra.com>
wrote:

> On 2015-10-15 01:40, Gregory Szorc wrote:
> > On Wed, Oct 14, 2015 at 4:31 PM, Adrian Buehlmann <adrian@cadifra.com
> > <mailto:adrian@cadifra.com>> wrote:
> >
> >     On 2015-10-15 00:34, Gregory Szorc wrote:
> >     [..]
> >     > +We can unpack packed1 bundles
> >     > +
> >     > +  $ hg init packed
> >     > +  $ hg -R packed unbundle packed.hg
> >     > +  6 files to transfer, 2.55 KB of data
> >     > +  transferred 2.55 KB in *.* seconds (*/sec) (glob)
> >     > +  (run 'hg heads' to see heads, 'hg merge' to merge)
> >     > +
> >
> >     Potentially stupid idea:
> >
> >     Why not implement a special form of the clone command, which reads
> from
> >     such a full "bundle" file and creates the repo from that bundle?
> >
> >        $ hg clone packed.hg packed
> >
> >
> > You can already do this! Although, it doesn't work with this new bundle
> > type because you need to teach bundlerepo.py about different bundle
> > types (it's also broken for bundle2 currently).
>
> This sounds a bit like a contradiction to me... ("You can already do
> this" vs "Although, it doesn't work...")
>

Sorry. You can `hg clone` from a bundle file. However, that only works for
HG10 bundle types (changegroup 1 data) today: the command will crash if you
attempt to clone from a bundle2 file. It will also crash with stream
clones. bundlerepo.py needs some love to make it work with different bundle
types.


>
> If it doesn't work, then it should probably be fixed to work with these
> kind of bundles?
>
> These kind of bundles can't be applied to a non-empty repo, so it seems
> to me that trying to make the unbundle command work with these is a bit
> of a misfit.
>

I'm sympathetic to this argument.


>
> The error message in the case below just looked a bit silly to me.. The
> next question then was: Why add support for these kind of "bundles" to
> the unbundle command, if they can only be used on empty repos? If you
> start doing that (adding support for these kind of bundles to the
> unbundle command), you will have to support it forever...
>
> >     I don't know if - alternatively - some new option would be needed.
> Maybe
> >     like this:
> >
> >        $ hg clone --unbundle packed.hg packed
> >
> >     > +We can't unpack packed1 bundles on non-empty repos
> >     > +
> >     > +  $ hg -R packed unbundle packed.hg
> >     > +  abort: cannot apply stream clone bundle on non-empty repo
> >     > +  [255]
> >     > +
> >
> >     ..which then could not happen any more
> >
> >
> > This touches on the subject of whether streaming clone / packed bundles
> > are or aren't bundles. I could argue both perspectives. But if they
> > aren't bundles, then we need to invent new commands and figure out a way
> > to shoehorn them into the new clone bundles feature. I think I'm fine
> > with either way: I just care most about getting streaming clone support
> > into the clone bundles feature otherwise the feature is useless for
> > Mozilla's automation needs.
>
> As I understand the matter, these new "full (streamclone) bundles"
> aren't really bundles to me. They can never be used like a classic
> bundle. What can you do with such a full bundle? The only command that
> makes sense is clone: Transform it into a new repo.
>
> But perhaps I'm missing your point here, as I'm mostly sitting on the
> fence. I got (unwillingly) involved into the business of streamcloning
> because streamcloning ('hg clone --uncompressed') was the preexisting
> feature which made the fncache file (.hg/store/fncache) necessary. Which
> in turn was used as the name of the new repo format that I helped
> introduce quite a while a ago. The only purpose of that repo format was
> to limit the length of path names under .hg/store, which was biting us
> on Windows. I then got bitten again when Bryan and myself tortured each
> other with translating "my" horrible store._hybridencode function into C
> code, which he found to be needed, because it turned out to be too slow
> for his important use case. Yikes.
>
> Now I see you inventing a new kind of bundles out of this streamclone
> stuff. Which is quite interesting.
>

Stream cloning to me is all about raw clone speed: the client can
effectively stream bits from the network to the filesystem. No relatively
expensive changegroup reading, delta application, zlib decompression, etc:
just the equivalent of `curl | tar`.

Mozilla wrote an extension that can download a pre-built stream clone file
and apply it. We're hosting these files in S3 and in EC2 we can clone at
50+ MB/s. The result is clone times that are 3-5x faster than traditional,
changegroup-based clones.

My end goal is to introduce a "packed revlog" data format that is
essentially an index + raw revlog data. This is *very* similar to streaming
clones bundles of today except there is a file index at the beginning so
you can quickly find revlog content for a specific file. I would like to
implement a "packed revlog repo" class that can read repositories directly
from this single file. When this is implemented, a streaming clone would
write a single, large "packed revlog" to disk (as opposed to 1 file per
revlog). You would be able to clone as fast as your network and filesystem
allowed. (Writing 100,000 files is slow, no matter how small.) You would be
able to clone a 1 GB repo in 10s over a 1 Gbps network.
https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/d6aeb3b8b512
contains the "stream clone v2" part of that work.

My immediate goal is to support stream clones over the new clone bundles
feature so core has feature parity with what Mozilla implemented. Honestly,
I'd be fine with not supporting unbundling of stream clone "bundles"
through any other means. As long as a server admin can produce them and
clone bundles can consume them, I'm happy. Considering the limitations of
these "bundles" with `hg unbundle`, perhaps it makes sense to go that
route...

Patch

diff --git a/mercurial/commands.py b/mercurial/commands.py
--- a/mercurial/commands.py
+++ b/mercurial/commands.py
@@ -6530,8 +6530,11 @@  def unbundle(ui, repo, fname1, *fnames, 
                         tr.release()
                 changes = [r.get('return', 0)
                            for r in op.records['changegroup']]
                 modheads = changegroup.combineresults(changes)
+            elif isinstance(gen, streamclone.unbundler):
+                gen.apply(repo)
+                modheads = len(repo.heads())
             else:
                 modheads = changegroup.addchangegroup(repo, gen, 'unbundle',
                                                       'bundle:' + fname)
     finally:
diff --git a/tests/test-bundle.t b/tests/test-bundle.t
--- a/tests/test-bundle.t
+++ b/tests/test-bundle.t
@@ -277,8 +277,22 @@  packed1 is produced properly
   0010: 00 00 00 00 0a 30 00 09 72 65 76 6c 6f 67 76 31 |.....0..revlogv1|
   0020: 00 64 61 74 61 2f 61 64 69 66 66 65 72 65 6e 74 |.data/adifferent|
   0030: 66 69 6c 65 2e 69 00 31 33 39 0a 00 01 00 01 00 |file.i.139......|
 
+We can unpack packed1 bundles
+
+  $ hg init packed
+  $ hg -R packed unbundle packed.hg
+  6 files to transfer, 2.55 KB of data
+  transferred 2.55 KB in *.* seconds (*/sec) (glob)
+  (run 'hg heads' to see heads, 'hg merge' to merge)
+
+We can't unpack packed1 bundles on non-empty repos
+
+  $ hg -R packed unbundle packed.hg
+  abort: cannot apply stream clone bundle on non-empty repo
+  [255]
+
 Create partial clones
 
   $ rm -r empty
   $ hg init empty