Patchwork D1856: wireproto: server-side support for pullbundles

login
register
mail settings
Submitter phabricator
Date Jan. 12, 2018, 5:32 p.m.
Message ID <differential-rev-PHID-DREV-hfnrwtu7rzlydfjp7wcr-req@phab.mercurial-scm.org>
Download mbox | patch
Permalink /patch/26706/
State Superseded
Headers show

Comments

phabricator - Jan. 12, 2018, 5:32 p.m.
joerg.sonnenberger created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  Pullbundles are similar to clonebundles, but served as normal inline
  bundle streams. They are almost transparent to the client -- the only
  visible effect is that the client might get less changes than what it
  asked for, i.e. not all requested head revisions are provided. The
  missing client-side logic is to retry a pull in that case.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

AFFECTED FILES
  mercurial/configitems.py
  mercurial/help/config.txt
  mercurial/wireproto.py
  tests/test-pull-r.t

CHANGE DETAILS




To: joerg.sonnenberger, #hg-reviewers
Cc: mercurial-devel
phabricator - Jan. 14, 2018, 4:06 p.m.
joerg.sonnenberger added a comment.


  For my test case, which is a bundle of all changes in the NetBSD repo before 2014 and a yearly bundle afterwards until 2018/1/1 and normal pull for the rest, find_pullbundle needs less than 0.5s of CPU time in this iteration when it matches.
  After the initial clone and with additional available changes, pull time is:
  
  With pullbundles.manifest: 0.42s
  Without pullbundles.manifest: 0.41s
  
  i.e. the difference is the noise level. Further benchmarks by others would be appreciated.
  
  The biggest remaining question for me is whether we want to introduce a capability for the client to mark that it is willing to do multiple pull rounds.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers
Cc: mercurial-devel
phabricator - Jan. 14, 2018, 10:43 p.m.
indygreg added subscribers: durin42, indygreg.
indygreg requested changes to this revision.
indygreg added a comment.
This revision now requires changes to proceed.


  This patch needs a lot of work. But I'm very supportive of the feature and the preliminary implementation!
  
  I wish we could find a way to send multiple, inline, pre-generated bundles in one response. However, the existing design of compression within the bundle2 format doesn't easily facilitate this. We should think long and hard about whether to implement this feature as //partial pull// or extend bundle2 to allow us to do nice things and avoid the extra client-server roundtrips.
  
  Speaking of compression, we probably need some more code around e.g. the mismatch between on-disk compression and wire compression. There is a potential for local bundles to be stored uncompressed and for the server to compress this data as part of streaming it out. That involves CPU. We still come out ahead because it avoids having to generate the bundle content from revlogs. But it isn't as ideal as streaming an already compressed bundle over the wire. We may want to open the bundle2 files and look at their compression info and make sure we don't do stupid things. It should be fast to parse the bundle2 header to obtain this knowledge.
  
  My biggest concern with the architecture of this feature is the multiple roundtrips. I **really** wish we could stream multiple bundles off disk to the wire with no decompression/compression involved. That would mean storing compressed bundles on disk. But this would require some additional bundle2 magic. The existing solution is simple and elegant. I do like that. I'd very much like to get the opinion of someone like @durin42 (who also likes designing protocols).

INLINE COMMENTS

> wireproto.py:834
>  
> +def find_pullbundle(repo, opts, clheads, heads, common):
> +    def decodehexstring(s):

This function needs a docstring. And the expected behavior of this function needs to be documented in the docstring and/or inline as comments.

> wireproto.py:836
> +    def decodehexstring(s):
> +        return set([h.decode('hex') for h in s.split(':')])
> +

`:` as a delimiter for nodes doesn't feel right because `:` has meaning for revsets. I would use `,` or `;`.

That being said, it may make the manifest format slightly less human friendly because URI encoding may come into play. That should be fine though: we already escape values for `BUNDLESPEC` when using `packed1` bundles.

> wireproto.py:838
> +
> +    manifest = repo.vfs.tryread('pullbundles.manifest')
> +    res = exchange.parseclonebundlesmanifest(repo, manifest)

I think there should be an `if not manifest: return` here to handle the common case.

> wireproto.py:848
> +    for entry in res:
> +        if 'heads' in entry:
> +            try:

We'll need documentation of the fields in the manifest somewhere. If we combine this feature with the `clonebundles.py` extension, that seems like the logical place to document things.

> wireproto.py:854-862
> +            if len(bundle_heads) > len(heads):
> +                # Client wants less heads than the bundle contains
> +                continue
> +            if bundle_heads.issubset(common):
> +                continue # Nothing new
> +            if all(cl.rev(rev) in common_anc for rev in bundle_heads):
> +                continue # Still nothing new

I worry about performance issues here. You were making some noise about this on IRC. So it sounds like it is on your radar.

I'm just not sure how to make things more efficient. I just know that doing so many DAG tests feels like it will lead to performance problems for repos with hundreds of thousands or millions of changesets.

But we shouldn't let performance issues derail this feature. Pull bundles have the potential to offload tens of seconds of CPU time from the server. So even if DAG operations consume a few seconds of CPU, we come out ahead. It would be nice to get that down to 100's of milliseconds at most though. But this feels like the territory of follow-up work, possibly involving caching or more efficient stores of which revisions are in which bundles.

> wireproto.py:874-877
> +        path = entry['URL']
> +        repo.ui.debug('sending pullbundle "%s"\n' % path)
> +        try:
> +            return repo.vfs.open(path)

Using a `URL` field for a vfs path (which doesn't have nor handle protocol schemes - e.g. `file://`` IIRC) feels wrong. I think the entry in the manifest should be `PATH`.

> wireproto.py:918
> +
> +        if repo.ui.configbool('server', 'pullbundle') and
> +           self.capable('partial-pull'):

I'd consider enabling this feature by default. Checking for the presence of a `pullbundles.manifest` should be a cheap operation. Especially when you consider that serving a bundle will open likely dozens of revlog files.

Another idea to consider is to tie this feature into the `clonebundles.py` extension. We already have extensive documentation in that extension for offloading server load via pre-generated bundles. I view this feature as a close sibling. I think they could live under the same umbrella. But because "pull bundles" are part of the `getbundle` wire protocol command, the bulk of the server code needs to remain in core.

> wireproto.py:919
> +        if repo.ui.configbool('server', 'pullbundle') and
> +           self.capable('partial-pull'):
> +            # Check if a pre-built bundle covers this request.

AFAICT this capability isn't defined anywhere. Is there a missing patch in this series?

I do agree with the use of a `partial-pull` capability rather than exposing //pull bundles// functionality in the capabilities layer. Pull bundles should be purely a server-side implementation detail. It's //partial pull// that is the generic feature that should be exposed as part of the wire protocol.

> wireproto.py:924
> +                return streamres(gen=util.filechunkiter(bundle),
> +                                 prefer_uncompressed=True)
> +

`prefer_uncompressed` doesn't exist.

I know you were talking about compression behavior on IRC. Please do invent whatever APIs you need to communicate to the wire protocol layer that you want to stream data with whatever compression behavior that you need.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel
phabricator - Jan. 14, 2018, 10:55 p.m.
indygreg added a comment.


  Another idea to consider is storing changegroup data or bundle2 parts on disk instead of full bundles. Then, we could stream multiple changegroup parts into a larger bundle2 payload.
  
  That does mean that'd we'd have to compress outgoing data over the wire. So not ideal if minimizing CPU usage is your goal. But we still come out ahead by not having to generate the changegroup data.
  
  This would require a mechanism to generate files of that format, however. Not a huge amount of work. But still work.
  
  I'm just not sure what the sweet spot for this feature is. Is it enough to eliminate changegroup generation overhead. Or are we striving for ~0 CPU usage to service `hg pull` operations (like what clone bundles achieve)?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel
phabricator - Jan. 14, 2018, 11:29 p.m.
joerg.sonnenberger added a comment.


  > I wish we could find a way to send multiple, inline, pre-generated bundles in one response. However, the existing design of compression within the bundle2 format doesn't easily facilitate this. We should think long and hard about whether to implement this feature as partial pull or extend bundle2 to allow us to do nice things and avoid the extra client-server roundtrips.
  
  I was looking at both parts initially. The existing bundle2 compression doesn't work very well for this purpose as it doesn't allow switching compression in the middle. It might be possible to hack something together with a flag to `hg bundle` to skip the final zero record and depend on the ability of all currently supported compression engines to do concat streams, but that feels like a crude hack. Doing extra client-server roundtrips is a bit annoying and costly, but it also makes the functionality a bit more forgiving for mistakes in the bundle specification. Unlike clonebundles, it requires the records to be much more precise and we certainly don't want to parse the bundles themselve over and over again. It might be nice for some future bundle version to allow easy access to the roots and leaves of the DAG.
  
  > Speaking of compression, we probably need some more code around e.g. the mismatch between on-disk compression and wire compression. There is a potential for local bundles to be stored uncompressed and for the server to compress this data as part of streaming it out. That involves CPU. We still come out ahead because it avoids having to generate the bundle content from revlogs. But it isn't as ideal as streaming an already compressed bundle over the wire. We may want to open the bundle2 files and look at their compression info and make sure we don't do stupid things. It should be fast to parse the bundle2 header to obtain this knowledge.
  
  The bundle will be sent without wire compression if supported. A v1 HTTP client will still get the basic zlib framing, but that's not really avoidable. Otherwise, BUNDLESPEC is supposed to make sure that the client knows how to deal with the bundle. The idea is explicitly that the administrator has a free choice over the compression format and the data is send as is. Only crypto overhead should apply.

INLINE COMMENTS

> indygreg wrote in wireproto.py:854-862
> I worry about performance issues here. You were making some noise about this on IRC. So it sounds like it is on your radar.
> 
> I'm just not sure how to make things more efficient. I just know that doing so many DAG tests feels like it will lead to performance problems for repos with hundreds of thousands or millions of changesets.
> 
> But we shouldn't let performance issues derail this feature. Pull bundles have the potential to offload tens of seconds of CPU time from the server. So even if DAG operations consume a few seconds of CPU, we come out ahead. It would be nice to get that down to 100's of milliseconds at most though. But this feels like the territory of follow-up work, possibly involving caching or more efficient stores of which revisions are in which bundles.

As I wrote in the comments on the client side, it seems to be OK now. The heavy lifting is the initial computation of the ancestor sets outside the loop, which is in the worst case an enumeration of all local revisions. That by itself doesn't seem to result in a big performance difference. The rest of the loop depends mostly on the  number of roots and leaves of the bundle DAGs. My test case had a total of ~800 and the impact on normal pull operations was in the noise.

> indygreg wrote in wireproto.py:874-877
> Using a `URL` field for a vfs path (which doesn't have nor handle protocol schemes - e.g. `file://`` IIRC) feels wrong. I think the entry in the manifest should be `PATH`.

The URL field is synthesized by `parseclonebundlesmanifest` for the first entry. It isn't tagged in the actual file. As long as the format of the files is otherwise the same, I'd just keep it identical.

> indygreg wrote in wireproto.py:918
> I'd consider enabling this feature by default. Checking for the presence of a `pullbundles.manifest` should be a cheap operation. Especially when you consider that serving a bundle will open likely dozens of revlog files.
> 
> Another idea to consider is to tie this feature into the `clonebundles.py` extension. We already have extensive documentation in that extension for offloading server load via pre-generated bundles. I view this feature as a close sibling. I think they could live under the same umbrella. But because "pull bundles" are part of the `getbundle` wire protocol command, the bulk of the server code needs to remain in core.

The bigger question is whether it should have an associated bundle or protocol capability. The client needs to be able to deal with the partial replies. If we go that way, I agree that it can and should be enabled by default.

The `clonebundles` extension is mostly for show nowadays, the interesting stuff is all in core.

> indygreg wrote in wireproto.py:924
> `prefer_uncompressed` doesn't exist.
> 
> I know you were talking about compression behavior on IRC. Please do invent whatever APIs you need to communicate to the wire protocol layer that you want to stream data with whatever compression behavior that you need.

Sorry, this was a rebase leak.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel
phabricator - Jan. 15, 2018, 4:02 p.m.
durin42 added a comment.


  In https://phab.mercurial-scm.org/D1856#31459, @indygreg wrote:
  
  > My biggest concern with the architecture of this feature is the multiple roundtrips. I **really** wish we could stream multiple bundles off disk to the wire with no decompression/compression involved. That would mean storing compressed bundles on disk. But this would require some additional bundle2 magic. The existing solution is simple and elegant. I do like that. I'd very much like to get the opinion of someone like @durin42 (who also likes designing protocols).
  
  
  I don't disagree with this goal (fewer roundtrips), but I think it's probably better than nothing to do it with multiple roundtrips.
  
  I'm too rusty on bundle2 at the moment to grok what magic would be required to pre-compress payloads.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel
phabricator - Jan. 15, 2018, 10:33 p.m.
indygreg added a comment.


  In https://phab.mercurial-scm.org/D1856#31522, @durin42 wrote:
  
  > I'm too rusty on bundle2 at the moment to grok what magic would be required to pre-compress payloads.
  
  
  The ideal solution would be a way to //reset// the context for the byte stream. Essentially we'd add a marker telling consumers they've reached EOF of either a bundle2 stream or a compression context. The next byte should be interpreted as a new bundle2 stream or a new compression context.
  
  In the case of zstandard, we can force the end of a zstandard frame and start sending a new frame. As long as the consumer reads all frames as one coherent stream, the output of the decompressor will look as if nothing special happened. I'm not actually sure if python-zstandard supports this though. But it could. For other compression formats, the answer isn't as clear cut.
  
  It's probably best to have an explicit marker in the bundle2 or protocol stream identifying the end of a compression context. Either a "reset at end of part" flag in the part header. Or an explicit bundle2 part that says "reset after me." Either way, we'd need to invent an API on the compression engine that allows the compression context to be //reset//. This code could get gnarly, as there are various buffers in play. It's possible. But a bit of work. Since we're at code freeze and it looks like this feature will have to wait for 4.6, it looks like we'll have time to implement this feature if we want to go that route. I don't like scope bloating though.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel
phabricator - Jan. 15, 2018, 10:39 p.m.
indygreg added a comment.


  In https://phab.mercurial-scm.org/D1856#31536, @indygreg wrote:
  
  > In https://phab.mercurial-scm.org/D1856#31522, @durin42 wrote:
  >
  > > I'm too rusty on bundle2 at the moment to grok what magic would be required to pre-compress payloads.
  >
  >
  > The ideal solution would be a way to //reset// the context for the byte stream. Essentially we'd add a marker telling consumers they've reached EOF of either a bundle2 stream or a compression context. The next byte should be interpreted as a new bundle2 stream or a new compression context.
  
  
  I should add that if clients automatically consumed all bundle2 payloads sent by the server and the compression context was automatically reset after end of bundle2 payload, then the easiest path forward is likely having the client advertise that it can receive multiple bundle2 payloads and then we implement this feature in terms of sending multiple bundle2 payloads. As long as the compression context is flushed, we should be OK. But we model the response stream as a single stream and feed that whole thing into a compressor/decompressor. We'd need to hack up the compression code a bit no matter what path we use.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel
phabricator - Jan. 17, 2018, 7:23 a.m.
indygreg requested changes to this revision.
indygreg added a comment.
This revision now requires changes to proceed.


  This is looking **much** better!
  
  There are some places where the code doesn't abide by our (IMO lousy) mergeallthewords naming convention.
  
  More importantly, I think there are some bugs around hidden changesets. We definitely need some test coverage involving hidden changesets. We should also have some basic tests for phases and bookmarks transfer. We want to be sure that all that data makes it across the wire and mirrors the remote's state. Specifically, I //think// there are still some bundle2 parts we only emit over the wire and not as part of local bundles generated via `hg bundle`. If so, then a pull serviced completely by pullbundles may not transfer all metadata. I'm not 100% sure about this. That's what tests will uncover. One thing we could do is ensure the client always sends a non-partial `getbundle` request to ensure the Mercurial server itself sends over any non-changegroup data. That shouldn't be necessary. And it complicates the wire protocol to communicate if the reply came from a pre-generated bundle. I'm not a huge fan of this... mostly thinking aloud here.
  
  Something else to think about (which is an outstanding problem with clone bundles) is generating the manifests. It is very much an exercise left to the reader. That kinda sucks. At some point we need to make `hg bundle` emit manifest lines or something. And the //bundlespec// format needs to be taught about individual bundle2 parts. We're still missing a way to define which parts/data to put in the generated bundle via `hg bundle` invocation (e.g. should we include obsolescence or phases parts) and how to express the presence of these parts in the //bundlespec//. I think we can implement this feature without solving those problems. But if `hg bundle` is producing bundles with required part processing, we won't be. (I'm not sure what `hg bundle` is doing these days.)
  
  Writing backwards and future compatible data storage and exchange protocols is hard...

INLINE COMMENTS

> clonebundles.py:61
> +4. If there is at least one matching bundle, the server sends it to the client.
> +5. The client applies and bundle and notices that the server reply was
> +   incomplete. It initiates another pull.

Nit: "the" instead of "and"

> exchange.py:1383-1385
> +                new_heads = headsofdiff(unficl.heads(), old_heads)
> +                pullop.common = headsofunion(new_heads, pullop.common)
> +                pullop.rheads = set(pullop.rheads) - pullop.common

I think there are some subtle bugs here around the use of unfiltered heads creeping into the wire protocol because `old_heads` can contained hidden changesets. The intent behind this code seems sound. But I think it needs to respect node visibility more strictly.

> wireproto.py:856
> +    res = exchange.parseclonebundlesmanifest(repo, manifest)
> +    res = exchange.filterclonebundleentries(repo, res)
> +    if not res:

This filtering takes the current process's capabilities into consideration. Since this code runs on the server, it is filtering based on the server's abilities, not the client's. That's bad because it means the server can send a bundle that the client can't read.

There will need to be a bit of code written to pass the client's capabilities into this filtering. That will likely require a significant amount of plumbing in `exchange.py`.

It /might/ be easier to take a different approach completely: choosing the same architecture as clonebundles and having the client download the manifest and fetching the relevant bundles itself. If you squint hard enough, this seems possible. The hard part is discovery. The server could potentially order the bundles based on DAG relationship / define dependencies between bundles based on DAG relationship. I //think// you've thought about this design already. But just want to check.

> test-pull-r.t:149
> +
> +Test pullbundle functionality
> +

I'd separate this into its own test file, since it is highly domain specific.

> test-pull-r.t:167-168
> +  > EOF
> +  $ hg serve --debug -p $HGPORT2 --pid-file=../repo.pid > ../repo-server.txt 2>&1 &
> +  $ while ! grep listening ../repo-server.txt > /dev/null; do sleep 1; done
> +  $ cat ../repo.pid >> $DAEMON_PIDS

I don't think I've seen this pattern in our .t tests before. I think you want to use `hg serve -d --accesslog <path> --errorlog <path>` like other tests do. There are no race conditions in daemon mode AFAIK, otherwise many tests would fall into this trap.

> test-pull-r.t:266
> +  added 0 changesets with 0 changes to 1 files
> +  abort: 00changelog.i@ed1b79f46b9a: no node!
> +  [255]

That's a pretty bad error message for something the user has no control over (a misconfigured server).

First, the error message should be better. Something like "the server didn't send expected data."

Second, the error message should offer the user a way to disable the feature and fall back to a traditional clone/pull. The clonebundles error messages do this. Read through `test-clonebundles.t`.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel
phabricator - Jan. 17, 2018, 2:15 p.m.
joerg.sonnenberger marked an inline comment as done.
joerg.sonnenberger added a comment.


  In https://phab.mercurial-scm.org/D1856#31624, @indygreg wrote:
  
  > This is looking **much** better!
  >
  > There are some places where the code doesn't abide by our (IMO lousy) mergeallthewords naming convention.
  
  
  I can rename the intermediate variables, if that is desirable.
  
  > More importantly, I think there are some bugs around hidden changesets. We definitely need some test coverage involving hidden changesets.
  
  There is some test coverage from the existing obsolete test cases.
  
  > We should also have some basic tests for phases and bookmarks transfer. We want to be sure that all that data makes it across the wire and mirrors the remote's state. Specifically, I //think// there are still some bundle2 parts we only emit over the wire and not as part of local bundles generated via `hg bundle`. If so, then a pull serviced completely by pullbundles may not transfer all metadata. I'm not 100% sure about this. That's what tests will uncover. One thing we could do is ensure the client always sends a non-partial `getbundle` request to ensure the Mercurial server itself sends over any non-changegroup data. That shouldn't be necessary. And it complicates the wire protocol to communicate if the reply came from a pre-generated bundle. I'm not a huge fan of this... mostly thinking aloud here.
  
  This would seem to be a bug in bundles and quite unrelated to this functionality. It would also cover clonebundles as well. I'm somewhat reluctant to increase this patch set even more.
  
  > Something else to think about (which is an outstanding problem with clone bundles) is generating the manifests. It is very much an exercise left to the reader. That kinda sucks. At some point we need to make `hg bundle` emit manifest lines or something. And the //bundlespec// format needs to be taught about individual bundle2 parts. We're still missing a way to define which parts/data to put in the generated bundle via `hg bundle` invocation (e.g. should we include obsolescence or phases parts) and how to express the presence of these parts in the //bundlespec//. I think we can implement this feature without solving those problems. But if `hg bundle` is producing bundles with required part processing, we won't be. (I'm not sure what `hg bundle` is doing these days.)
  
  I'm preparing a set a helper scripts separately and there were some discussions about stable history partitioning separately that could also help. The goal still is to make this fully transparent from the client beyond the need for fetching multiple bundles.

INLINE COMMENTS

> indygreg wrote in exchange.py:1383-1385
> I think there are some subtle bugs here around the use of unfiltered heads creeping into the wire protocol because `old_heads` can contained hidden changesets. The intent behind this code seems sound. But I think it needs to respect node visibility more strictly.

I expect the original discover to handle the visibility question. The updates here deal strictly with the results of the exchange, i.e. any new head is a result of the pulled bundle as as such visible from remote. Same for the reduction of rheads. The unfiltered access is necessary as soon from the obs test cases.

> indygreg wrote in wireproto.py:856
> This filtering takes the current process's capabilities into consideration. Since this code runs on the server, it is filtering based on the server's abilities, not the client's. That's bad because it means the server can send a bundle that the client can't read.
> 
> There will need to be a bit of code written to pass the client's capabilities into this filtering. That will likely require a significant amount of plumbing in `exchange.py`.
> 
> It /might/ be easier to take a different approach completely: choosing the same architecture as clonebundles and having the client download the manifest and fetching the relevant bundles itself. If you squint hard enough, this seems possible. The hard part is discovery. The server could potentially order the bundles based on DAG relationship / define dependencies between bundles based on DAG relationship. I //think// you've thought about this design already. But just want to check.

I have to double check the feature checks. The client has notified the server of the supported bundle formats at this point, so it shouldn't be a problem. I'll look into a test case.

I don't think the client is in a reasonable position to make this decisions as it doesn't have the actual DAG at hand. In principle, the server could at a later iteration also estimate the bundle size or similar aspects and that's something the client fundamentally can't do.

> indygreg wrote in test-pull-r.t:266
> That's a pretty bad error message for something the user has no control over (a misconfigured server).
> 
> First, the error message should be better. Something like "the server didn't send expected data."
> 
> Second, the error message should offer the user a way to disable the feature and fall back to a traditional clone/pull. The clonebundles error messages do this. Read through `test-clonebundles.t`.

I didn't add the error message, it existed already. It's also specific to the update case. I'm somewhat reluctant to expose pullbundles anymore to the user as it is even more supposed to be transparent than clonebundles.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel
phabricator - Jan. 19, 2018, 9 p.m.
durin42 added a comment.


  I like where this is headed, but we'll have to plan to land this early in the next cycle, as I'm too chicken to land a feature with protocol implications on the last day of work before we freeze

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel
phabricator - Feb. 4, 2018, 8:42 p.m.
indygreg added a comment.


  @joerg.sonnenberger: if you are willing to wait just a week or two more, I think the next version of the wire protocol that I'm writing will make the "pullbundles" feature significantly better. Specifically, servers will be able to send multiple bundle2 bundles or parts with independent compression settings. In other words, a pullbundles aware server-side handler could "stream" pre-generated bundles from disk and then dynamically emit the data not found in any pre-generated bundles, applying compression for just the dynamic bit.
  
  I don't wish to create stop energy and discourage you from working on this feature. But at the same time, I'm reluctant to add new features to the existing wire protocol because the new mechanism will be much, much better and will make features like pullbundles even better.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel
phabricator - April 7, 2018, 11:18 p.m.
joerg.sonnenberger added a comment.


  Updated to the current tree. One open question is an interaction with the narrow extension. Running the narrow tests will show an additional round trip.
  
  The discovery phase currently doesn't know about the narrowspec, so all heads are discovered by the client, but the latter narrowbundle only covers the matching heads. This in turn triggers the partial-pull logic from the change, i.e. the client assumes that the server send a partial reply and asks again. IMO this is a design flaw in the narrow extension and should be addressed properly on that side.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1856

To: joerg.sonnenberger, #hg-reviewers, indygreg
Cc: indygreg, durin42, mercurial-devel

Patch

diff --git a/tests/test-pull-r.t b/tests/test-pull-r.t
--- a/tests/test-pull-r.t
+++ b/tests/test-pull-r.t
@@ -145,3 +145,59 @@ 
 
   $ cd ..
   $ killdaemons.py
+
+Test pullbundle functionality
+
+  $ cd repo
+  $ cat <<EOF > .hg/hgrc
+  > [server]
+  > pullbundle = True
+  > EOF
+  $ hg bundle --base null -r 0 .hg/0.hg
+  1 changesets found
+  $ hg bundle --base 0 -r 1 .hg/1.hg
+  1 changesets found
+  $ hg bundle --base 1 -r 2 .hg/2.hg
+  1 changesets found
+  $ cat <<EOF > .hg/pullbundles.manifest
+  > 2.hg heads=effea6de0384e684f44435651cb7bd70b8735bd4 base=bbd179dfa0a71671c253b3ae0aa1513b60d199fa
+  > 1.hg heads=ed1b79f46b9a29f5a6efa59cf12fcfca43bead5a base=bbd179dfa0a71671c253b3ae0aa1513b60d199fa
+  > 0.hg heads=bbd179dfa0a71671c253b3ae0aa1513b60d199fa
+  > EOF
+  $ hg serve --debug -p $HGPORT2 --pid-file=../repo.pid > ../repo-server.txt 2>&1 &
+  $ while ! grep listening ../repo-server.txt > /dev/null; do sleep 1; done
+  $ cat ../repo.pid >> $DAEMON_PIDS
+  $ cd ..
+  $ hg clone -r 0 http://localhost:$HGPORT2/ repo.pullbundle
+  adding changesets
+  adding manifests
+  adding file changes
+  added 1 changesets with 1 changes to 1 files
+  new changesets bbd179dfa0a7
+  updating to branch default
+  1 files updated, 0 files merged, 0 files removed, 0 files unresolved
+  $ cd repo.pullbundle
+  $ hg pull -r 1
+  pulling from http://localhost:$HGPORT2/
+  searching for changes
+  adding changesets
+  adding manifests
+  adding file changes
+  added 1 changesets with 1 changes to 1 files
+  new changesets ed1b79f46b9a
+  (run 'hg update' to get a working copy)
+  $ hg pull -r 2
+  pulling from http://localhost:$HGPORT2/
+  searching for changes
+  adding changesets
+  adding manifests
+  adding file changes
+  added 1 changesets with 1 changes to 1 files (+1 heads)
+  new changesets effea6de0384
+  (run 'hg heads' to see heads, 'hg merge' to merge)
+  $ cd ..
+  $ killdaemons.py
+  $ grep 'sending pullbundle ' repo-server.txt
+  sending pullbundle "0.hg"
+  sending pullbundle "1.hg"
+  sending pullbundle "2.hg"
diff --git a/mercurial/wireproto.py b/mercurial/wireproto.py
--- a/mercurial/wireproto.py
+++ b/mercurial/wireproto.py
@@ -831,6 +831,65 @@ 
     opts = options('debugwireargs', ['three', 'four'], others)
     return repo.debugwireargs(one, two, **pycompat.strkwargs(opts))
 
+def find_pullbundle(repo, opts, clheads, heads, common):
+    def decodehexstring(s):
+        return set([h.decode('hex') for h in s.split(':')])
+
+    def hasdescendant(ancestor, revs):
+        # Quick path for exact match
+        if ancestor in revs:
+            return True
+        for rev in revs:
+            if repo.changelog.isancestor(ancestor, rev):
+                return True
+        return False
+    def hasancestor(descendant, revs):
+        if descendant in revs:
+            return True
+        for rev in revs:
+            if repo.changelog.isancestor(rev, descendant):
+                return True
+        return False
+
+    manifest = repo.vfs.tryread('pullbundles.manifest')
+    res = exchange.parseclonebundlesmanifest(repo, manifest)
+    res = exchange.filterclonebundleentries(repo, res)
+    for entry in res:
+        if 'heads' in entry:
+            try:
+                bundle_heads = decodehexstring(entry['heads'])
+            except TypeError:
+                # Bad heads entry
+                continue
+            if len(bundle_heads) > len(heads):
+                # Client wants less heads than the bundle contains
+                continue
+            if bundle_heads.issubset(common):
+                continue # Nothing new
+            if all(hasdescendant(rev, common) for rev in bundle_heads):
+                continue # Still nothing new
+            if any(not hasdescendant(rev, heads) for rev in bundle_heads):
+                continue
+        if 'bases' in entry:
+            try:
+                bundle_bases = decodehexstring(entry['bases'])
+            except TypeError:
+                # Bad bases entry
+                continue
+            if len(bundle_bases) < len(common):
+                # Client is missing a revision the bundle requires
+                continue
+            if all(hasdescendant(rev, common) for rev in bundle_bases):
+                continue
+        path = entry['URL']
+        repo.ui.debug('sending pullbundle "%s"\n' % path)
+        try:
+            return repo.vfs.open(path)
+        except IOError:
+            repo.ui.debug('pullbundle "%s" not accessible\n' % path)
+            continue
+    return None
+
 @wireprotocommand('getbundle', '*')
 def getbundle(repo, proto, others):
     opts = options('getbundle', gboptsmap.keys(), others)
@@ -861,12 +920,20 @@ 
                               hint=bundle2requiredhint)
 
     try:
+        clheads = set(repo.changelog.heads())
+        heads = set(opts.get('heads', set()))
+        common = set(opts.get('common', set()))
+        common.discard(nullid)
+
+        if repo.ui.configbool('server', 'pullbundle'):
+            # Check if a pre-built bundle covers this request.
+            bundle = find_pullbundle(repo, opts, clheads, heads, common)
+            if bundle:
+                return streamres(gen=util.filechunkiter(bundle),
+                                 prefer_uncompressed=True)
+
         if repo.ui.configbool('server', 'disablefullbundle'):
             # Check to see if this is a full clone.
-            clheads = set(repo.changelog.heads())
-            heads = set(opts.get('heads', set()))
-            common = set(opts.get('common', set()))
-            common.discard(nullid)
             if not common and clheads == heads:
                 raise error.Abort(
                     _('server has pull-based clones disabled'),
diff --git a/mercurial/help/config.txt b/mercurial/help/config.txt
--- a/mercurial/help/config.txt
+++ b/mercurial/help/config.txt
@@ -1772,6 +1772,14 @@ 
     are highly recommended. Partial clones will still be allowed.
     (default: False)
 
+``pullbundle``
+    When set, the server will check pullbundle.manifest for bundles
+    covering the requested heads and common nodes. The first matching
+    entry will be streamed to the client.
+
+    For HTTP transport, the stream will still use zlib compression
+    for older clients.
+
 ``concurrent-push-mode``
     Level of allowed race condition between two pushing clients.
 
diff --git a/mercurial/configitems.py b/mercurial/configitems.py
--- a/mercurial/configitems.py
+++ b/mercurial/configitems.py
@@ -877,6 +877,9 @@ 
 coreconfigitem('server', 'disablefullbundle',
     default=False,
 )
+coreconfigitem('server', 'pullbundle',
+    default=False,
+)
 coreconfigitem('server', 'maxhttpheaderlen',
     default=1024,
 )