Patchwork [5,of,5] exchange: refactor APIs to obtain bundle data (API)

login
register
mail settings
Submitter Gregory Szorc
Date Sept. 25, 2016, 8:42 p.m.
Message ID <f6bb4ff0a8c47b099157.1474836137@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/16787/
State Superseded
Delegated to: Pierre-Yves David
Headers show

Comments

Gregory Szorc - Sept. 25, 2016, 8:42 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1474830761 25200
#      Sun Sep 25 12:12:41 2016 -0700
# Node ID f6bb4ff0a8c47b099157e615a2874c193ac77512
# Parent  2dc93677a8836d2365b561d1ba79f62ff68a4f23
exchange: refactor APIs to obtain bundle data (API)

Currently, exchange.getbundle() returns either a cg1unpacker or a
util.chunkbuffer (in the case of bundle2). This is kinda OK, as
both expose a .read() to consumers. However, localpeer.getbundle()
has code inferring what the response type is based on arguments and
converts the util.chunkbuffer returned in the bundle2 case to a
bundle2.unbundle20 instance. This is a sign that the API for
exchange.getbundle() is not ideal because it doesn't consistently
return an "unbundler" instance.

In addition, unbundlers mask the fact that there is an underlying
generator of changegroup data. In both cg1 and bundle2, this generator
is being fed into a util.chunkbuffer so it can be re-exposed as a
file object.

util.chunkbuffer is a nice abstraction. However, it should only be
used "at the edges." This is because keeping data as a generator is
more efficient than converting it to a chunkbuffer, especially if we
convert that chunkbuffer back to a generator (as is the case in some
code paths currently).

This patch splits exchange.getbundle() into 2 functions. 1 returns
an iterator over raw chunks (along with a flag saying whether it is
bundle2). The other returns an "unbundler" instance.

Callers of exchange.getbundle() have been updated to use the
appropriate new API.

There is a minor change of behavior in test-getbundle.t. This is
because `hg debuggetbundle` isn't defining bundlecaps. As a result,
a cg1 data stream and unpacker is being produced. This is getting fed
into a new bundle20 instance via bundle2.writebundle(), which uses
a backchannel mechanism between changegroup generation to add the
"nbchanges" part parameter. I never liked this backchannel mechanism
and I plan to remove it someday. `hg bundle` still produces the
"nbchanges" part parameter, so there should be no user-visible
change of behavior. I consider this "regression" a bug in
`hg debuggetbundle`. And that bug is captured by an existing
"TODO" in the code to use bundle2 capabilities.
Pierre-Yves David - Sept. 27, 2016, 1:45 p.m.
On 09/25/2016 10:42 PM, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1474830761 25200
> #      Sun Sep 25 12:12:41 2016 -0700
> # Node ID f6bb4ff0a8c47b099157e615a2874c193ac77512
> # Parent  2dc93677a8836d2365b561d1ba79f62ff68a4f23
> exchange: refactor APIs to obtain bundle data (API)
>
> Currently, exchange.getbundle() returns either a cg1unpacker or a
> util.chunkbuffer (in the case of bundle2). This is kinda OK, as
> both expose a .read() to consumers. However, localpeer.getbundle()
> has code inferring what the response type is based on arguments and
> converts the util.chunkbuffer returned in the bundle2 case to a
> bundle2.unbundle20 instance. This is a sign that the API for
> exchange.getbundle() is not ideal because it doesn't consistently
> return an "unbundler" instance.

I wonder how much this is used in extension. Should we keep the old API 
with a deprecation warning for a version ?

> In addition, unbundlers mask the fact that there is an underlying
> generator of changegroup data. In both cg1 and bundle2, this generator
> is being fed into a util.chunkbuffer so it can be re-exposed as a
> file object.
>
> util.chunkbuffer is a nice abstraction. However, it should only be
> used "at the edges." This is because keeping data as a generator is
> more efficient than converting it to a chunkbuffer, especially if we
> convert that chunkbuffer back to a generator (as is the case in some
> code paths currently).

I know that the chunkbuffer+groupchunk is not removed yet. But can you 
remind up of the expected performance gain when the refactoring is done 
doing?

> This patch splits exchange.getbundle() into 2 functions. 1 returns
> an iterator over raw chunks (along with a flag saying whether it is
> bundle2). The other returns an "unbundler" instance.

Given how few call site we have, I wonder if we really need 2 functions. 
Could we simply move to getbundlechunks? And have the unbundler logic in 
the one place where we needs it.

There seems to already be "I'm requesting" a bundle2 logic here so 
inlining this would be fine. We need to perform such logic in the client 
in all case because of the remote (ssh/http) anyway.

This would allow us to remove the 'isbundle2' flag from getbundlechunks 
too. That flags seems a bit awkward to me.

> Callers of exchange.getbundle() have been updated to use the
> appropriate new API.
>
> There is a minor change of behavior in test-getbundle.t. This is
> because `hg debuggetbundle` isn't defining bundlecaps. As a result,
> a cg1 data stream and unpacker is being produced. This is getting fed
> into a new bundle20 instance via bundle2.writebundle(), which uses
> a backchannel mechanism between changegroup generation to add the
> "nbchanges" part parameter. I never liked this backchannel mechanism
> and I plan to remove it someday. `hg bundle` still produces the
> "nbchanges" part parameter, so there should be no user-visible
> change of behavior. I consider this "regression" a bug in
> `hg debuggetbundle`. And that bug is captured by an existing
> "TODO" in the code to use bundle2 capabilities.

How hard would it be to fix this 'debuggetbundle' thingy beforehand?

>
> diff --git a/mercurial/exchange.py b/mercurial/exchange.py
> --- a/mercurial/exchange.py
> +++ b/mercurial/exchange.py
> @@ -1526,43 +1526,38 @@ def getbundle2partsgenerator(stepname, i
>          return func
>      return dec
>
>  def bundle2requested(bundlecaps):
>      if bundlecaps is not None:
>          return any(cap.startswith('HG2') for cap in bundlecaps)
>      return False
>
> -def getbundle(repo, source, heads=None, common=None, bundlecaps=None,
> -              **kwargs):
> -    """return a full bundle (with potentially multiple kind of parts)
> +def getbundlechunks(repo, source, heads=None, common=None, bundlecaps=None,
> +                    **kwargs):
> +    """Return chunks constituting a bundle's raw data.
>
>      Could be a bundle HG10 or a bundle HG20 depending on bundlecaps
> -    passed. For now, the bundle can contain only changegroup, but this will
> -    changes when more part type will be available for bundle2.
> +    passed.
>
> -    This is different from changegroup.getchangegroup that only returns an HG10
> -    changegroup bundle. They may eventually get reunited in the future when we
> -    have a clearer idea of the API we what to query different data.
> -
> -    The implementation is at a very early stage and will get massive rework
> -    when the API of bundle is refined.
> +    Returns a 2-tuple of (isbundle2, chunks) with the 2nd item being an
> +    iterator over raw chunks (of varying sizes).
>      """
>      usebundle2 = bundle2requested(bundlecaps)
>      # bundle10 case
>      if not usebundle2:
>          if bundlecaps and not kwargs.get('cg', True):
>              raise ValueError(_('request for bundle10 must include changegroup'))
>
>          if kwargs:
>              raise ValueError(_('unsupported getbundle arguments: %s')
>                               % ', '.join(sorted(kwargs.keys())))
>          outgoing = _computeoutgoing(repo, heads, common)
> -        return changegroup.getchangegroup(repo, source, outgoing,
> -                                          bundlecaps=bundlecaps)
> +        bundler = changegroup.getbundler('01', repo, bundlecaps)
> +        return False, changegroup.getsubsetraw(repo, outgoing, bundler, source)
>
>      # bundle20 case
>      b2caps = {}
>      for bcaps in bundlecaps:
>          if bcaps.startswith('bundle2='):
>              blob = urlreq.unquote(bcaps[len('bundle2='):])
>              b2caps.update(bundle2.decodecaps(blob))
>      bundler = bundle2.bundle20(repo.ui, b2caps)
> @@ -1570,17 +1565,32 @@ def getbundle(repo, source, heads=None,
>      kwargs['heads'] = heads
>      kwargs['common'] = common
>
>      for name in getbundle2partsorder:
>          func = getbundle2partsmapping[name]
>          func(bundler, repo, source, bundlecaps=bundlecaps, b2caps=b2caps,
>               **kwargs)
>
> -    return util.chunkbuffer(bundler.getchunks())
> +    return usebundle2, bundler.getchunks()
> +
> +def getunbundler(repo, *args, **kwargs):
> +    """Obtain an unbundler from arguments.
> +
> +    This is essentially a wrapper around ``getbundlechunks()`` that converts
> +    the result to an unbundler instance.
> +
> +    Where performance is critical, ``getbundlechunks()`` should be used.
> +    """
> +    isbundle2, chunks = getbundlechunks(repo, *args, **kwargs)
> +
> +    if isbundle2:
> +        return bundle2.getunbundler(repo.ui, util.chunkbuffer(chunks))
> +    else:
> +        return changegroup.getunbundler('01', util.chunkbuffer(chunks), None)
>
>  @getbundle2partsgenerator('changegroup')
>  def _getbundlechangegrouppart(bundler, repo, source, bundlecaps=None,
>                                b2caps=None, heads=None, common=None, **kwargs):
>      """add a changegroup part to the requested bundle"""
>      cg = None
>      if kwargs.get('cg', True):
>          # build changegroup bundle here.
> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
> --- a/mercurial/localrepo.py
> +++ b/mercurial/localrepo.py
> @@ -144,24 +144,19 @@ class localpeer(peer.peerrepository):
>      def heads(self):
>          return self._repo.heads()
>
>      def known(self, nodes):
>          return self._repo.known(nodes)
>
>      def getbundle(self, source, heads=None, common=None, bundlecaps=None,
>                    **kwargs):
> -        cg = exchange.getbundle(self._repo, source, heads=heads,
> -                                common=common, bundlecaps=bundlecaps, **kwargs)
> -        if bundlecaps is not None and 'HG20' in bundlecaps:
> -            # When requesting a bundle2, getbundle returns a stream to make the
> -            # wire level function happier. We need to build a proper object
> -            # from it in local peer.
> -            cg = bundle2.getunbundler(self.ui, cg)
> -        return cg
> +        return exchange.getunbundler(self._repo, source, heads=heads,
> +                                     common=common, bundlecaps=bundlecaps,
> +                                     **kwargs)
>
>      # TODO We might want to move the next two calls into legacypeer and add
>      # unbundle instead.
>
>      def unbundle(self, cg, heads, url):
>          """apply a bundle on a repo
>
>          This function handles the repo locking itself."""
> diff --git a/mercurial/wireproto.py b/mercurial/wireproto.py
> --- a/mercurial/wireproto.py
> +++ b/mercurial/wireproto.py
> @@ -767,18 +767,20 @@ def getbundle(repo, proto, others):
>          elif keytype != 'plain':
>              raise KeyError('unknown getbundle option type %s'
>                             % keytype)
>
>      if not bundle1allowed(repo, 'pull'):
>          if not exchange.bundle2requested(opts.get('bundlecaps')):
>              return ooberror(bundle2required)
>
> -    cg = exchange.getbundle(repo, 'serve', **opts)
> -    return streamres(proto.groupchunks(cg))
> +    isbundle2, chunks = exchange.getbundlechunks(repo, 'serve', **opts)
> +    # TODO avoid util.chunkbuffer() here since it is adding overhead to
> +    # what is fundamentally a generator proxying operation.
> +    return streamres(proto.groupchunks(util.chunkbuffer(chunks)))
>
>  @wireprotocommand('heads')
>  def heads(repo, proto):
>      h = repo.heads()
>      return encodelist(h) + "\n"
>
>  @wireprotocommand('hello')
>  def hello(repo, proto):
> diff --git a/tests/test-getbundle.t b/tests/test-getbundle.t
> --- a/tests/test-getbundle.t
> +++ b/tests/test-getbundle.t
> @@ -165,17 +165,17 @@ Get branch and merge:
>    8365676dbab05860ce0d9110f2af51368b961bbd
>    0b2f73f04880d9cb6a5cd8a757f0db0ad01e32c3
>
>  = Test bundle2 =
>
>    $ hg debuggetbundle repo bundle -t bundle2
>    $ hg debugbundle bundle
>    Stream params: {}
> -  changegroup -- "sortdict([('version', '01'), ('nbchanges', '18')])"
> +  changegroup -- "sortdict([('version', '01')])"
>        7704483d56b2a7b5db54dcee7c62378ac629b348
>        29a4d1f17bd3f0779ca0525bebb1cfb51067c738
>        713346a995c363120712aed1aee7e04afd867638
>        d5f6e1ea452285324836a49d7d3c2a63cfed1d31
>        ff42371d57168345fdf1a3aac66a51f6a45d41d2
>        bac16991d12ff45f9dc43c52da1946dfadb83e80
>        6621d79f61b23ec74cf4b69464343d9e0980ec8b
>        8931463777131cd73923e560b760061f2aa8a4bc
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Augie Fackler - Sept. 27, 2016, 2:20 p.m.
On Tue, Sep 27, 2016 at 03:45:34PM +0200, Pierre-Yves David wrote:
>
>
> On 09/25/2016 10:42 PM, Gregory Szorc wrote:
> ># HG changeset patch
> ># User Gregory Szorc <gregory.szorc@gmail.com>
> ># Date 1474830761 25200
> >#      Sun Sep 25 12:12:41 2016 -0700
> ># Node ID f6bb4ff0a8c47b099157e615a2874c193ac77512
> ># Parent  2dc93677a8836d2365b561d1ba79f62ff68a4f23
> >exchange: refactor APIs to obtain bundle data (API)
> >
> >Currently, exchange.getbundle() returns either a cg1unpacker or a
> >util.chunkbuffer (in the case of bundle2). This is kinda OK, as
> >both expose a .read() to consumers. However, localpeer.getbundle()
> >has code inferring what the response type is based on arguments and
> >converts the util.chunkbuffer returned in the bundle2 case to a
> >bundle2.unbundle20 instance. This is a sign that the API for
> >exchange.getbundle() is not ideal because it doesn't consistently
> >return an "unbundler" instance.
>
> I wonder how much this is used in extension. Should we keep the old API with
> a deprecation warning for a version ?

Spot-checking, hgsubversion, hg-git, and narrowhg all don't need to
touch this. Given the, um, involved jiggery pokery that narrowhg is
doing on bundle streams, I think it's *probably* safe to ditch this
without a deprecation warning period.
Gregory Szorc - Sept. 27, 2016, 4:12 p.m.
On Tue, Sep 27, 2016 at 6:45 AM, Pierre-Yves David <
pierre-yves.david@ens-lyon.org> wrote:

>
>
> On 09/25/2016 10:42 PM, Gregory Szorc wrote:
>
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1474830761 25200
>> #      Sun Sep 25 12:12:41 2016 -0700
>> # Node ID f6bb4ff0a8c47b099157e615a2874c193ac77512
>> # Parent  2dc93677a8836d2365b561d1ba79f62ff68a4f23
>> exchange: refactor APIs to obtain bundle data (API)
>>
>> Currently, exchange.getbundle() returns either a cg1unpacker or a
>> util.chunkbuffer (in the case of bundle2). This is kinda OK, as
>> both expose a .read() to consumers. However, localpeer.getbundle()
>> has code inferring what the response type is based on arguments and
>> converts the util.chunkbuffer returned in the bundle2 case to a
>> bundle2.unbundle20 instance. This is a sign that the API for
>> exchange.getbundle() is not ideal because it doesn't consistently
>> return an "unbundler" instance.
>>
>
> I wonder how much this is used in extension. Should we keep the old API
> with a deprecation warning for a version ?
>
> In addition, unbundlers mask the fact that there is an underlying
>> generator of changegroup data. In both cg1 and bundle2, this generator
>> is being fed into a util.chunkbuffer so it can be re-exposed as a
>> file object.
>>
>> util.chunkbuffer is a nice abstraction. However, it should only be
>> used "at the edges." This is because keeping data as a generator is
>> more efficient than converting it to a chunkbuffer, especially if we
>> convert that chunkbuffer back to a generator (as is the case in some
>> code paths currently).
>>
>
> I know that the chunkbuffer+groupchunk is not removed yet. But can you
> remind up of the expected performance gain when the refactoring is done
> doing?
>

I think we could see a 5-10% CPU reduction on the server when everything is
done.


>
> This patch splits exchange.getbundle() into 2 functions. 1 returns
>> an iterator over raw chunks (along with a flag saying whether it is
>> bundle2). The other returns an "unbundler" instance.
>>
>
> Given how few call site we have, I wonder if we really need 2 functions.
> Could we simply move to getbundlechunks? And have the unbundler logic in
> the one place where we needs it.
>
> There seems to already be "I'm requesting" a bundle2 logic here so
> inlining this would be fine. We need to perform such logic in the client in
> all case because of the remote (ssh/http) anyway.
>
> This would allow us to remove the 'isbundle2' flag from getbundlechunks
> too. That flags seems a bit awkward to me.
>

The APIs around cg1 vs bundle2 are wonky and often require the client to
know what is requesting before the fact. In the ideal world callers would
pass "bundlecaps" or some such and get a generic type/interface back. There
is a lot of follow-up improvement that could be made. I didn't want to
scope bloat to fix all the APIs. I think this refactor is a strict
improvement because it provides a facility for accessing the low-level
generator, which we can now use in performance sensitive applications.


>
> Callers of exchange.getbundle() have been updated to use the
>> appropriate new API.
>>
>> There is a minor change of behavior in test-getbundle.t. This is
>> because `hg debuggetbundle` isn't defining bundlecaps. As a result,
>> a cg1 data stream and unpacker is being produced. This is getting fed
>> into a new bundle20 instance via bundle2.writebundle(), which uses
>> a backchannel mechanism between changegroup generation to add the
>> "nbchanges" part parameter. I never liked this backchannel mechanism
>> and I plan to remove it someday. `hg bundle` still produces the
>> "nbchanges" part parameter, so there should be no user-visible
>> change of behavior. I consider this "regression" a bug in
>> `hg debuggetbundle`. And that bug is captured by an existing
>> "TODO" in the code to use bundle2 capabilities.
>>
>
> How hard would it be to fix this 'debuggetbundle' thingy beforehand?
>

debuggetbundle is already incomplete. I didn't want to scope bloat this
work to fix an existing, unrelated deficiency. I'm not sure how much work
fixing debuggetbundle would be.


>
>
>> diff --git a/mercurial/exchange.py b/mercurial/exchange.py
>> --- a/mercurial/exchange.py
>> +++ b/mercurial/exchange.py
>> @@ -1526,43 +1526,38 @@ def getbundle2partsgenerator(stepname, i
>>          return func
>>      return dec
>>
>>  def bundle2requested(bundlecaps):
>>      if bundlecaps is not None:
>>          return any(cap.startswith('HG2') for cap in bundlecaps)
>>      return False
>>
>> -def getbundle(repo, source, heads=None, common=None, bundlecaps=None,
>> -              **kwargs):
>> -    """return a full bundle (with potentially multiple kind of parts)
>> +def getbundlechunks(repo, source, heads=None, common=None,
>> bundlecaps=None,
>> +                    **kwargs):
>> +    """Return chunks constituting a bundle's raw data.
>>
>>      Could be a bundle HG10 or a bundle HG20 depending on bundlecaps
>> -    passed. For now, the bundle can contain only changegroup, but this
>> will
>> -    changes when more part type will be available for bundle2.
>> +    passed.
>>
>> -    This is different from changegroup.getchangegroup that only returns
>> an HG10
>> -    changegroup bundle. They may eventually get reunited in the future
>> when we
>> -    have a clearer idea of the API we what to query different data.
>> -
>> -    The implementation is at a very early stage and will get massive
>> rework
>> -    when the API of bundle is refined.
>> +    Returns a 2-tuple of (isbundle2, chunks) with the 2nd item being an
>> +    iterator over raw chunks (of varying sizes).
>>      """
>>      usebundle2 = bundle2requested(bundlecaps)
>>      # bundle10 case
>>      if not usebundle2:
>>          if bundlecaps and not kwargs.get('cg', True):
>>              raise ValueError(_('request for bundle10 must include
>> changegroup'))
>>
>>          if kwargs:
>>              raise ValueError(_('unsupported getbundle arguments: %s')
>>                               % ', '.join(sorted(kwargs.keys())))
>>          outgoing = _computeoutgoing(repo, heads, common)
>> -        return changegroup.getchangegroup(repo, source, outgoing,
>> -                                          bundlecaps=bundlecaps)
>> +        bundler = changegroup.getbundler('01', repo, bundlecaps)
>> +        return False, changegroup.getsubsetraw(repo, outgoing, bundler,
>> source)
>>
>>      # bundle20 case
>>      b2caps = {}
>>      for bcaps in bundlecaps:
>>          if bcaps.startswith('bundle2='):
>>              blob = urlreq.unquote(bcaps[len('bundle2='):])
>>              b2caps.update(bundle2.decodecaps(blob))
>>      bundler = bundle2.bundle20(repo.ui, b2caps)
>> @@ -1570,17 +1565,32 @@ def getbundle(repo, source, heads=None,
>>      kwargs['heads'] = heads
>>      kwargs['common'] = common
>>
>>      for name in getbundle2partsorder:
>>          func = getbundle2partsmapping[name]
>>          func(bundler, repo, source, bundlecaps=bundlecaps, b2caps=b2caps,
>>               **kwargs)
>>
>> -    return util.chunkbuffer(bundler.getchunks())
>> +    return usebundle2, bundler.getchunks()
>> +
>> +def getunbundler(repo, *args, **kwargs):
>> +    """Obtain an unbundler from arguments.
>> +
>> +    This is essentially a wrapper around ``getbundlechunks()`` that
>> converts
>> +    the result to an unbundler instance.
>> +
>> +    Where performance is critical, ``getbundlechunks()`` should be used.
>> +    """
>> +    isbundle2, chunks = getbundlechunks(repo, *args, **kwargs)
>> +
>> +    if isbundle2:
>> +        return bundle2.getunbundler(repo.ui, util.chunkbuffer(chunks))
>> +    else:
>> +        return changegroup.getunbundler('01', util.chunkbuffer(chunks),
>> None)
>>
>>  @getbundle2partsgenerator('changegroup')
>>  def _getbundlechangegrouppart(bundler, repo, source, bundlecaps=None,
>>                                b2caps=None, heads=None, common=None,
>> **kwargs):
>>      """add a changegroup part to the requested bundle"""
>>      cg = None
>>      if kwargs.get('cg', True):
>>          # build changegroup bundle here.
>> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
>> --- a/mercurial/localrepo.py
>> +++ b/mercurial/localrepo.py
>> @@ -144,24 +144,19 @@ class localpeer(peer.peerrepository):
>>      def heads(self):
>>          return self._repo.heads()
>>
>>      def known(self, nodes):
>>          return self._repo.known(nodes)
>>
>>      def getbundle(self, source, heads=None, common=None, bundlecaps=None,
>>                    **kwargs):
>> -        cg = exchange.getbundle(self._repo, source, heads=heads,
>> -                                common=common, bundlecaps=bundlecaps,
>> **kwargs)
>> -        if bundlecaps is not None and 'HG20' in bundlecaps:
>> -            # When requesting a bundle2, getbundle returns a stream to
>> make the
>> -            # wire level function happier. We need to build a proper
>> object
>> -            # from it in local peer.
>> -            cg = bundle2.getunbundler(self.ui, cg)
>> -        return cg
>> +        return exchange.getunbundler(self._repo, source, heads=heads,
>> +                                     common=common,
>> bundlecaps=bundlecaps,
>> +                                     **kwargs)
>>
>>      # TODO We might want to move the next two calls into legacypeer and
>> add
>>      # unbundle instead.
>>
>>      def unbundle(self, cg, heads, url):
>>          """apply a bundle on a repo
>>
>>          This function handles the repo locking itself."""
>> diff --git a/mercurial/wireproto.py b/mercurial/wireproto.py
>> --- a/mercurial/wireproto.py
>> +++ b/mercurial/wireproto.py
>> @@ -767,18 +767,20 @@ def getbundle(repo, proto, others):
>>          elif keytype != 'plain':
>>              raise KeyError('unknown getbundle option type %s'
>>                             % keytype)
>>
>>      if not bundle1allowed(repo, 'pull'):
>>          if not exchange.bundle2requested(opts.get('bundlecaps')):
>>              return ooberror(bundle2required)
>>
>> -    cg = exchange.getbundle(repo, 'serve', **opts)
>> -    return streamres(proto.groupchunks(cg))
>> +    isbundle2, chunks = exchange.getbundlechunks(repo, 'serve', **opts)
>> +    # TODO avoid util.chunkbuffer() here since it is adding overhead to
>> +    # what is fundamentally a generator proxying operation.
>> +    return streamres(proto.groupchunks(util.chunkbuffer(chunks)))
>>
>>  @wireprotocommand('heads')
>>  def heads(repo, proto):
>>      h = repo.heads()
>>      return encodelist(h) + "\n"
>>
>>  @wireprotocommand('hello')
>>  def hello(repo, proto):
>> diff --git a/tests/test-getbundle.t b/tests/test-getbundle.t
>> --- a/tests/test-getbundle.t
>> +++ b/tests/test-getbundle.t
>> @@ -165,17 +165,17 @@ Get branch and merge:
>>    8365676dbab05860ce0d9110f2af51368b961bbd
>>    0b2f73f04880d9cb6a5cd8a757f0db0ad01e32c3
>>
>>  = Test bundle2 =
>>
>>    $ hg debuggetbundle repo bundle -t bundle2
>>    $ hg debugbundle bundle
>>    Stream params: {}
>> -  changegroup -- "sortdict([('version', '01'), ('nbchanges', '18')])"
>> +  changegroup -- "sortdict([('version', '01')])"
>>        7704483d56b2a7b5db54dcee7c62378ac629b348
>>        29a4d1f17bd3f0779ca0525bebb1cfb51067c738
>>        713346a995c363120712aed1aee7e04afd867638
>>        d5f6e1ea452285324836a49d7d3c2a63cfed1d31
>>        ff42371d57168345fdf1a3aac66a51f6a45d41d2
>>        bac16991d12ff45f9dc43c52da1946dfadb83e80
>>        6621d79f61b23ec74cf4b69464343d9e0980ec8b
>>        8931463777131cd73923e560b760061f2aa8a4bc
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>>
> --
> Pierre-Yves David
>
Pierre-Yves David - Sept. 28, 2016, 12:11 p.m.
On 09/27/2016 06:12 PM, Gregory Szorc wrote:
> On Tue, Sep 27, 2016 at 6:45 AM, Pierre-Yves David
> <pierre-yves.david@ens-lyon.org <mailto:pierre-yves.david@ens-lyon.org>>
> wrote:
>
>
>
>     On 09/25/2016 10:42 PM, Gregory Szorc wrote:
>
>         # HG changeset patch
>         # User Gregory Szorc <gregory.szorc@gmail.com
>         <mailto:gregory.szorc@gmail.com>>
>         # Date 1474830761 25200
>         #      Sun Sep 25 12:12:41 2016 -0700
>         # Node ID f6bb4ff0a8c47b099157e615a2874c193ac77512
>         # Parent  2dc93677a8836d2365b561d1ba79f62ff68a4f23
>         exchange: refactor APIs to obtain bundle data (API)
>
>         Currently, exchange.getbundle() returns either a cg1unpacker or a
>         util.chunkbuffer (in the case of bundle2). This is kinda OK, as
>         both expose a .read() to consumers. However, localpeer.getbundle()
>         has code inferring what the response type is based on arguments and
>         converts the util.chunkbuffer returned in the bundle2 case to a
>         bundle2.unbundle20 instance. This is a sign that the API for
>         exchange.getbundle() is not ideal because it doesn't consistently
>         return an "unbundler" instance.
>
>
>     I wonder how much this is used in extension. Should we keep the old
>     API with a deprecation warning for a version ?
>
>         In addition, unbundlers mask the fact that there is an underlying
>         generator of changegroup data. In both cg1 and bundle2, this
>         generator
>         is being fed into a util.chunkbuffer so it can be re-exposed as a
>         file object.
>
>         util.chunkbuffer is a nice abstraction. However, it should only be
>         used "at the edges." This is because keeping data as a generator is
>         more efficient than converting it to a chunkbuffer, especially if we
>         convert that chunkbuffer back to a generator (as is the case in some
>         code paths currently).
>
>
>     I know that the chunkbuffer+groupchunk is not removed yet. But can
>     you remind up of the expected performance gain when the refactoring
>     is done doing?
>
>
> I think we could see a 5-10% CPU reduction on the server when everything
> is done.

That's great, can you includes this projection into the changeset 
description?

>         This patch splits exchange.getbundle() into 2 functions. 1 returns
>         an iterator over raw chunks (along with a flag saying whether it is
>         bundle2). The other returns an "unbundler" instance.
>
>
>     Given how few call site we have, I wonder if we really need 2
>     functions. Could we simply move to getbundlechunks? And have the
>     unbundler logic in the one place where we needs it.
>
>     There seems to already be "I'm requesting" a bundle2 logic here so
>     inlining this would be fine. We need to perform such logic in the
>     client in all case because of the remote (ssh/http) anyway.
>
>     This would allow us to remove the 'isbundle2' flag from
>     getbundlechunks too. That flags seems a bit awkward to me.
>
>
> The APIs around cg1 vs bundle2 are wonky and often require the client to
> know what is requesting before the fact. In the ideal world callers
> would pass "bundlecaps" or some such and get a generic type/interface
> back. There is a lot of follow-up improvement that could be made.

 From this patch, there is only on spot that actually use this 
information. The 'localpeer.getbundle' method. This method is the 
equivalent of the 'wirepeer.getbundle' method. The 'wirepeer' version 
will never be able to access this 'isbundle2' boolean. Because it 
communicate over the wire to a frozen API. As a result, this 'wirepeer' 
version already have logic to handle 'cg1' vs 'bundle2' and will have 
too keep it forever. With this in mind, I think is make sense to 
simplify the main API and keep the "complexity" in the one caller that 
needs, it. We could even factorize the detection of bundle2 between 
'localpeer' and 'wirepeer' and get similar benefit.

All the other caller just care about getting chunk and forwarding them. 
Not using this boolean. I think having a simpler return type (chunks 
instead of tuple) would be better.

> I didn't want to scope bloat to fix all the APIs. I think this refactor is
> a strict improvement because it provides a facility for accessing the
> low-level generator, which we can now use in performance sensitive
> applications.

This refactor is a good small step. Dropping that flag would make it 
even smaller as you are removing existing logic about bundle2 detection.

>         Callers of exchange.getbundle() have been updated to use the
>         appropriate new API.
>
>         There is a minor change of behavior in test-getbundle.t. This is
>         because `hg debuggetbundle` isn't defining bundlecaps. As a result,
>         a cg1 data stream and unpacker is being produced. This is
>         getting fed
>         into a new bundle20 instance via bundle2.writebundle(), which uses
>         a backchannel mechanism between changegroup generation to add the
>         "nbchanges" part parameter. I never liked this backchannel mechanism
>         and I plan to remove it someday. `hg bundle` still produces the
>         "nbchanges" part parameter, so there should be no user-visible
>         change of behavior. I consider this "regression" a bug in
>         `hg debuggetbundle`. And that bug is captured by an existing
>         "TODO" in the code to use bundle2 capabilities.
>
>
>     How hard would it be to fix this 'debuggetbundle' thingy beforehand?
>
>
> debuggetbundle is already incomplete. I didn't want to scope bloat this
> work to fix an existing, unrelated deficiency. I'm not sure how much
> work fixing debuggetbundle would be.

Okay.
At some point one should probably look into debuggetbundle and either 
makes is closer to reality or get ride of it. But I agree this is 
outside of this series scope.

Cheers,

Patch

diff --git a/mercurial/exchange.py b/mercurial/exchange.py
--- a/mercurial/exchange.py
+++ b/mercurial/exchange.py
@@ -1526,43 +1526,38 @@  def getbundle2partsgenerator(stepname, i
         return func
     return dec
 
 def bundle2requested(bundlecaps):
     if bundlecaps is not None:
         return any(cap.startswith('HG2') for cap in bundlecaps)
     return False
 
-def getbundle(repo, source, heads=None, common=None, bundlecaps=None,
-              **kwargs):
-    """return a full bundle (with potentially multiple kind of parts)
+def getbundlechunks(repo, source, heads=None, common=None, bundlecaps=None,
+                    **kwargs):
+    """Return chunks constituting a bundle's raw data.
 
     Could be a bundle HG10 or a bundle HG20 depending on bundlecaps
-    passed. For now, the bundle can contain only changegroup, but this will
-    changes when more part type will be available for bundle2.
+    passed.
 
-    This is different from changegroup.getchangegroup that only returns an HG10
-    changegroup bundle. They may eventually get reunited in the future when we
-    have a clearer idea of the API we what to query different data.
-
-    The implementation is at a very early stage and will get massive rework
-    when the API of bundle is refined.
+    Returns a 2-tuple of (isbundle2, chunks) with the 2nd item being an
+    iterator over raw chunks (of varying sizes).
     """
     usebundle2 = bundle2requested(bundlecaps)
     # bundle10 case
     if not usebundle2:
         if bundlecaps and not kwargs.get('cg', True):
             raise ValueError(_('request for bundle10 must include changegroup'))
 
         if kwargs:
             raise ValueError(_('unsupported getbundle arguments: %s')
                              % ', '.join(sorted(kwargs.keys())))
         outgoing = _computeoutgoing(repo, heads, common)
-        return changegroup.getchangegroup(repo, source, outgoing,
-                                          bundlecaps=bundlecaps)
+        bundler = changegroup.getbundler('01', repo, bundlecaps)
+        return False, changegroup.getsubsetraw(repo, outgoing, bundler, source)
 
     # bundle20 case
     b2caps = {}
     for bcaps in bundlecaps:
         if bcaps.startswith('bundle2='):
             blob = urlreq.unquote(bcaps[len('bundle2='):])
             b2caps.update(bundle2.decodecaps(blob))
     bundler = bundle2.bundle20(repo.ui, b2caps)
@@ -1570,17 +1565,32 @@  def getbundle(repo, source, heads=None, 
     kwargs['heads'] = heads
     kwargs['common'] = common
 
     for name in getbundle2partsorder:
         func = getbundle2partsmapping[name]
         func(bundler, repo, source, bundlecaps=bundlecaps, b2caps=b2caps,
              **kwargs)
 
-    return util.chunkbuffer(bundler.getchunks())
+    return usebundle2, bundler.getchunks()
+
+def getunbundler(repo, *args, **kwargs):
+    """Obtain an unbundler from arguments.
+
+    This is essentially a wrapper around ``getbundlechunks()`` that converts
+    the result to an unbundler instance.
+
+    Where performance is critical, ``getbundlechunks()`` should be used.
+    """
+    isbundle2, chunks = getbundlechunks(repo, *args, **kwargs)
+
+    if isbundle2:
+        return bundle2.getunbundler(repo.ui, util.chunkbuffer(chunks))
+    else:
+        return changegroup.getunbundler('01', util.chunkbuffer(chunks), None)
 
 @getbundle2partsgenerator('changegroup')
 def _getbundlechangegrouppart(bundler, repo, source, bundlecaps=None,
                               b2caps=None, heads=None, common=None, **kwargs):
     """add a changegroup part to the requested bundle"""
     cg = None
     if kwargs.get('cg', True):
         # build changegroup bundle here.
diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -144,24 +144,19 @@  class localpeer(peer.peerrepository):
     def heads(self):
         return self._repo.heads()
 
     def known(self, nodes):
         return self._repo.known(nodes)
 
     def getbundle(self, source, heads=None, common=None, bundlecaps=None,
                   **kwargs):
-        cg = exchange.getbundle(self._repo, source, heads=heads,
-                                common=common, bundlecaps=bundlecaps, **kwargs)
-        if bundlecaps is not None and 'HG20' in bundlecaps:
-            # When requesting a bundle2, getbundle returns a stream to make the
-            # wire level function happier. We need to build a proper object
-            # from it in local peer.
-            cg = bundle2.getunbundler(self.ui, cg)
-        return cg
+        return exchange.getunbundler(self._repo, source, heads=heads,
+                                     common=common, bundlecaps=bundlecaps,
+                                     **kwargs)
 
     # TODO We might want to move the next two calls into legacypeer and add
     # unbundle instead.
 
     def unbundle(self, cg, heads, url):
         """apply a bundle on a repo
 
         This function handles the repo locking itself."""
diff --git a/mercurial/wireproto.py b/mercurial/wireproto.py
--- a/mercurial/wireproto.py
+++ b/mercurial/wireproto.py
@@ -767,18 +767,20 @@  def getbundle(repo, proto, others):
         elif keytype != 'plain':
             raise KeyError('unknown getbundle option type %s'
                            % keytype)
 
     if not bundle1allowed(repo, 'pull'):
         if not exchange.bundle2requested(opts.get('bundlecaps')):
             return ooberror(bundle2required)
 
-    cg = exchange.getbundle(repo, 'serve', **opts)
-    return streamres(proto.groupchunks(cg))
+    isbundle2, chunks = exchange.getbundlechunks(repo, 'serve', **opts)
+    # TODO avoid util.chunkbuffer() here since it is adding overhead to
+    # what is fundamentally a generator proxying operation.
+    return streamres(proto.groupchunks(util.chunkbuffer(chunks)))
 
 @wireprotocommand('heads')
 def heads(repo, proto):
     h = repo.heads()
     return encodelist(h) + "\n"
 
 @wireprotocommand('hello')
 def hello(repo, proto):
diff --git a/tests/test-getbundle.t b/tests/test-getbundle.t
--- a/tests/test-getbundle.t
+++ b/tests/test-getbundle.t
@@ -165,17 +165,17 @@  Get branch and merge:
   8365676dbab05860ce0d9110f2af51368b961bbd
   0b2f73f04880d9cb6a5cd8a757f0db0ad01e32c3
 
 = Test bundle2 =
 
   $ hg debuggetbundle repo bundle -t bundle2
   $ hg debugbundle bundle
   Stream params: {}
-  changegroup -- "sortdict([('version', '01'), ('nbchanges', '18')])"
+  changegroup -- "sortdict([('version', '01')])"
       7704483d56b2a7b5db54dcee7c62378ac629b348
       29a4d1f17bd3f0779ca0525bebb1cfb51067c738
       713346a995c363120712aed1aee7e04afd867638
       d5f6e1ea452285324836a49d7d3c2a63cfed1d31
       ff42371d57168345fdf1a3aac66a51f6a45d41d2
       bac16991d12ff45f9dc43c52da1946dfadb83e80
       6621d79f61b23ec74cf4b69464343d9e0980ec8b
       8931463777131cd73923e560b760061f2aa8a4bc