Patchwork [1,of,9,changegroup-apis] exchange: use early return

login
register
mail settings
Submitter Gregory Szorc
Date Aug. 1, 2016, 6:18 p.m.
Message ID <18275f2b6d2b8fcb50ec.1470075498@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/16015/
State Changes Requested
Delegated to: Pierre-Yves David
Headers show

Comments

Gregory Szorc - Aug. 1, 2016, 6:18 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1469900681 25200
#      Sat Jul 30 10:44:41 2016 -0700
# Node ID 18275f2b6d2b8fcb50ecaf331011e3d233ca1cfa
# Parent  73ff159923c1f05899c27238409ca398342d9ae0
exchange: use early return

Avoid excessive indentation and confusing control flow by returning
early if there is no work to be done.
Gregory Szorc - Aug. 1, 2016, 6:20 p.m.
This is the first half of the series. The second half can be found at
https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if
someone wants to review the whole thing.

Word of warning: if you pull that repo, you'll get tons of obsolescence
markers. You may want to disable obsolescence when pulling.

On Mon, Aug 1, 2016 at 11:18 AM, Gregory Szorc <gregory.szorc@gmail.com>
wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1469900681 25200
> #      Sat Jul 30 10:44:41 2016 -0700
> # Node ID 18275f2b6d2b8fcb50ecaf331011e3d233ca1cfa
> # Parent  73ff159923c1f05899c27238409ca398342d9ae0
> exchange: use early return
>
> Avoid excessive indentation and confusing control flow by returning
> early if there is no work to be done.
>
> diff --git a/mercurial/exchange.py b/mercurial/exchange.py
> --- a/mercurial/exchange.py
> +++ b/mercurial/exchange.py
> @@ -1547,39 +1547,42 @@ def getbundle(repo, source, heads=None,
>               **kwargs)
>
>      return util.chunkbuffer(bundler.getchunks())
>
>  @getbundle2partsgenerator('changegroup')
>  def _getbundlechangegrouppart(bundler, repo, source, bundlecaps=None,
>                                b2caps=None, heads=None, common=None,
> **kwargs):
>      """add a changegroup part to the requested bundle"""
> -    cg = None
> -    if kwargs.get('cg', True):
> -        # build changegroup bundle here.
> -        version = '01'
> -        cgversions = b2caps.get('changegroup')
> -        if cgversions:  # 3.1 and 3.2 ship with an empty value
> -            cgversions = [v for v in cgversions
> -                          if v in
> changegroup.supportedoutgoingversions(repo)]
> -            if not cgversions:
> -                raise ValueError(_('no common changegroup version'))
> -            version = max(cgversions)
> -        outgoing = changegroup.computeoutgoing(repo, heads, common)
> -        cg = changegroup.getlocalchangegroupraw(repo, source, outgoing,
> -                                                bundlecaps=bundlecaps,
> -                                                version=version)
> +    if not kwargs.get('cg', True):
> +        return
>
> -    if cg:
> -        part = bundler.newpart('changegroup', data=cg)
> -        if cgversions:
> -            part.addparam('version', version)
> -        part.addparam('nbchanges', str(len(outgoing.missing)),
> mandatory=False)
> -        if 'treemanifest' in repo.requirements:
> -            part.addparam('treemanifest', '1')
> +    # build changegroup bundle here.
> +    version = '01'
> +    cgversions = b2caps.get('changegroup')
> +    if cgversions:  # 3.1 and 3.2 ship with an empty value
> +        cgversions = [v for v in cgversions
> +                      if v in changegroup.supportedoutgoingversions(repo)]
> +        if not cgversions:
> +            raise ValueError(_('no common changegroup version'))
> +        version = max(cgversions)
> +    outgoing = changegroup.computeoutgoing(repo, heads, common)
> +    cg = changegroup.getlocalchangegroupraw(repo, source, outgoing,
> +                                            bundlecaps=bundlecaps,
> +                                            version=version)
> +
> +    if not cg:
> +        return
> +
> +    part = bundler.newpart('changegroup', data=cg)
> +    if cgversions:
> +        part.addparam('version', version)
> +    part.addparam('nbchanges', str(len(outgoing.missing)),
> mandatory=False)
> +    if 'treemanifest' in repo.requirements:
> +        part.addparam('treemanifest', '1')
>
>  @getbundle2partsgenerator('listkeys')
>  def _getbundlelistkeysparts(bundler, repo, source, bundlecaps=None,
>                              b2caps=None, **kwargs):
>      """add parts containing listkeys namespaces to the requested bundle"""
>      listkeys = kwargs.get('listkeys', ())
>      for namespace in listkeys:
>          part = bundler.newpart('listkeys')
>
Pierre-Yves David - Aug. 2, 2016, 1:44 a.m.
On 08/01/2016 08:20 PM, Gregory Szorc wrote:
> This is the first half of the series. The second half can be found at
> https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if
> someone wants to review the whole thing.

I've made a first review pass on this series. As much as I like the idea
of cleaning this up, the proposed implementation makes me uncomfortable.
The main points that troubles me are:

* the extra new class is a bit of a mix of (1) outgoing object (2) cg1
bundler object. This makes a the boundary of these object more fuzzy and
I not convinced yet your cleanup could not happen with the existing objects,

* many @class method usage while our coding style usually avoid them
along other "class as namespace" usage (as per Matt taste),

* the
superlongnameforeverything.thatcombinewithclassnamspacing.foroverninethousand
charline we could probably me more compact using a abbreviation for
common parts as we do elsewhere in the code.

Let me do a full scan of the second part of the series tomorrow and then
I either understand the purity of your way, or find some super precise
feed back, or we might discuss it over voice.

> Word of warning: if you pull that repo, you'll get tons of obsolescence
> markers. You may want to disable obsolescence when pulling.

These are legit markers that do not exist in the repository because we
use email and we somewhat managed to not pull from you in the past. We
should be okay pushing them to the main server (the marker/changeset
ratio is a bit high but fine).

Cheers,
Gregory Szorc - Aug. 2, 2016, 2:50 a.m.
On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David <
pierre-yves.david@ens-lyon.org> wrote:

>
>
> On 08/01/2016 08:20 PM, Gregory Szorc wrote:
> > This is the first half of the series. The second half can be found at
> > https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if
> > someone wants to review the whole thing.
>
> I've made a first review pass on this series. As much as I like the idea
> of cleaning this up, the proposed implementation makes me uncomfortable.
> The main points that troubles me are:
>
> * the extra new class is a bit of a mix of (1) outgoing object (2) cg1
> bundler object. This makes a the boundary of these object more fuzzy


Not sure what you mean here. An outgoing object is basically a fancy tuple
holding the set of common and missing nodes. I'm not sure what a "cg1
bundler" is. A non-bundle2 is just a v1 changegroup with a 6 byte header.

If you are referring to returning a cg<N>unpacker instance, I'm not a huge
fan of that. I only did that for parity with existing consumers because I
didn't want to scope bloat to create a 40+ patch series. I do intend to do
some follow-up work here because our handling of changegroup generation is
just bonkers. For example, sending a bundle 2 to the wire protocol with the
current code involves:

1. Construct a cg<N>packer and call generate() to obtain a generator of
changegroup chunks
2. Store a reference to the generator on a bundle2 part
3. Convert generator to a util.chunkbuffer (a file-like object) as part of
bundlepart._payloadchunks() when serializing the bundle2 part
4. Emit a generator of 4k chunks because that's what the
bundlepart.getchunks() does
5. Feed this generator of chunks into a util.chunkbuffer holding the entire
bundle2 data (in exchange.getbundle())
6. Construct a zlib compressor
7. read(4096) from the util.chunkbuffer into the zlib compressor
8. Emit a generator of zlib compressed chunks to be sent to WSGI

util.chunkbuffer overhead has appeared in profiling results when cloning
Firefox. The thing is that none of these util.chunkbuffer are really
necessary for the wire protocol case: we could feed a generator of chunks
of various sizes into zlib then emit whatever chunks it spits out, possibly
normalizing them to equal sizes if we really wanted.

I think the end state of this series involves a "changegroup emitter" being
fed into some kind of "bundler," preferably as a generator. Using the
cg1unpacker for this purpose is kind of silly, as it necessitates a
util.chunkbuffer. The good news is the "changegroupemitter" as it stands
can easily be represented as both a generator of chunks or a file handle.
So by introducing this object and requiring callers to use it, we can start
transitioning callers to not use util.chunkbuffer without having to change
APIs on changegroup.py. It's much more flexible.



> and I not convinced yet your cleanup could not happen with the existing
> objects,
>

The existing functions are horrible and not very useful. As this series
demonstrates, we had something like 6 functions producing either:

* a generator of changegroup chunks
* a cg<N>unpacker instance

from

* an outgoing instance
* a set of bases and heads
* a set common nodes and heads

sometimes with knobs to control the changegroup version.

The API was unclear and far too low-level (e.g. sometimes you wanted to
access the number of changesets or the nodes in the changegroup afterwards
and the API hid these from you). To be honest, it wasn't clear to me that
this was how things worked until I wrote the series. That's not a good sign.

The new API solves makes thing simpler by allowing you to construct a
stateful object to representing a generated changegroup and then get a
changegroup representation various ways. It is much more explicit.


> * many @class method usage while our coding style usually avoid them
> along other "class as namespace" usage (as per Matt taste),
>

That's fine. These can be pulled out into module-level functions. e.g.
emitterfromrootsandheads(repo, roots, heads) -> changegroupemitter


>
> * the
>
> superlongnameforeverything.thatcombinewithclassnamspacing.foroverninethousand
> charline we could probably me more compact using a abbreviation for
> common parts as we do elsewhere in the code.
>

I agree the names are a bit long. But I'm not a fan of abbreviations
because they make readability hard, especially for those who don't speak
English natively.

I think remaining "changegroupemitter" to "emitter" (because "changegroup"
is redundant with the module name) and moving the @classmethods to module
functions will make things much better. e.g.

emitter = changegroup.changegroupemitter.fromheadsandcommon(repo, heads, common)
emitter = changegroup.emitterfromheadsandcommon(repo, heads, common)

Or if wanted to write a complex __init__ function, we could have it accept
all arguments. But I don't like complexity and prefer separate
"constructor" functions.


> Let me do a full scan of the second part of the series tomorrow and then
> I either understand the purity of your way, or find some super precise
> feed back, or we might discuss it over voice.
>
> > Word of warning: if you pull that repo, you'll get tons of obsolescence
> > markers. You may want to disable obsolescence when pulling.
>
> These are legit markers that do not exist in the repository because we
> use email and we somewhat managed to not pull from you in the past. We
> should be okay pushing them to the main server (the marker/changeset
> ratio is a bit high but fine).
>
> Cheers,
>
> --
> Pierre-Yves David
>
Pierre-Yves David - Aug. 3, 2016, 8:56 a.m.
On 08/02/2016 04:50 AM, Gregory Szorc wrote:
> On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David
> <pierre-yves.david@ens-lyon.org <mailto:pierre-yves.david@ens-lyon.org>>
> wrote:
> 
>     On 08/01/2016 08:20 PM, Gregory Szorc wrote:
>     > This is the first half of the series. The second half can be found at
>     > https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if
>     > someone wants to review the whole thing.
> 
>     I've made a first review pass on this series. As much as I like the idea
>     of cleaning this up, the proposed implementation makes me uncomfortable.
>     The main points that troubles me are:
> 
>     * the extra new class is a bit of a mix of (1) outgoing object (2) cg1
>     bundler object. This makes a the boundary of these object more fuzzy
> 
> 
> Not sure what you mean here. An outgoing object is basically a fancy
> tuple holding the set of common and missing nodes.

My point here is that Outgoing is a class meant to represent this
information, across various user. We should aim at increasing it usage
and visibility instead of hiding it under custom API exposing part of
it. For example, the 'emitter.changesetcount' should probably just be
`emiter.outgoing` with outgoing exposed elsewhere (eg: pushoperation)
programmer will eventually just go "ho, an outgoing object, I know that".

> I'm not sure what a "cg1 bundler" is. A non-bundle2 is just a v1 changegroup with a 6 byte
> header.
> 
> If you are referring to returning a cg<N>unpacker instance, I'm not a
> huge fan of that. I only did that for parity with existing consumers
> because I didn't want to scope bloat to create a 40+ patch series.

My point is that we already have class emiting changegroup data. I'm
aware this area is not great. And I trust you in you will to clean up
all this. However, it is currently unclear to me why we cannot improve
them to fit your needs instead of introducing a new class.

> I do intend to do some follow-up work here because our handling of
> changegroup generation is just bonkers. For example, sending a bundle 2
> to the wire protocol with the current code involves:
> 
> 1. Construct a cg<N>packer and call generate() to obtain a generator of
> changegroup chunks
> 2. Store a reference to the generator on a bundle2 part
> 3. Convert generator to a util.chunkbuffer (a file-like object) as part
> of bundlepart._payloadchunks() when serializing the bundle2 part
> 4. Emit a generator of 4k chunks because that's what the
> bundlepart.getchunks() does
> 5. Feed this generator of chunks into a util.chunkbuffer holding the
> entire bundle2 data (in exchange.getbundle())
> 6. Construct a zlib compressor
> 7. read(4096) from the util.chunkbuffer into the zlib compressor
> 8. Emit a generator of zlib compressed chunks to be sent to WSGI

To make sure I'm clear, I'm not challenging you diagnostic that there is
improvement to be made. In your example. there is multiple things
unrelated to the changegroup generation. I'm usually a proponent of
improving existing layers instead of introducing a new ones (does not
mean you cannot convince me that you approach here is right).

> util.chunkbuffer overhead has appeared in profiling results when cloning
> Firefox. The thing is that none of these util.chunkbuffer are really
> necessary for the wire protocol case: we could feed a generator of
> chunks of various sizes into zlib then emit whatever chunks it spits
> out, possibly normalizing them to equal sizes if we really wanted.

Yep, there is a lot of small improvement we could do. As you point out
the step (5) (6) could probably be easily simplified. But this seems
unrelated to the current rework of changegroup emition. Is it?

> I think the end state of this series involves a "changegroup emitter"
> being fed into some kind of "bundler," preferably as a generator. Using
> the cg1unpacker for this purpose is kind of silly, as it necessitates a
> util.chunkbuffer. The good news is the "changegroupemitter" as it stands
> can easily be represented as both a generator of chunks or a file
> handle. So by introducing this object and requiring callers to use it,
> we can start transitioning callers to not use util.chunkbuffer without
> having to change APIs on changegroup.py. It's much more flexible.

This paragraph highlight what seems to be the most important thing to
address in this discussion. You seems to try to avoid touching the
existing changegroup API as much as possible and I do not understand
why. My general feeling is that it should be possible to upgrade the
existing class to fit your needs, but your seems to think otherwise. I'm
curious to find out why as you usually have pretty good reason for doing
things?

>     and I not convinced yet your cleanup could not happen with the
>     existing objects,
> 
> 
> The existing functions are horrible and not very useful. As this series
> demonstrates, we had something like 6 functions producing either:
> 
> * a generator of changegroup chunks
> * a cg<N>unpacker instance
> 
> from
> 
> * an outgoing instance
> * a set of bases and heads
> * a set common nodes and heads
> 
> sometimes with knobs to control the changegroup version.
> 
> The API was unclear and far too low-level (e.g. sometimes you wanted to
> access the number of changesets or the nodes in the changegroup
> afterwards and the API hid these from you). To be honest, it wasn't
> clear to me that this was how things worked until I wrote the series.
> That's not a good sign.

(Reminder: I'm aware the current code could use improvement)

You new code seems to keep the 3 options for specifying the outgoing
set. So I assume you focussed on getting a simpler return type. Can you
elaborate on why do we need multiple return type in the first place?

Also, could we simplify the other side of this. For example we could
just keep the outgoing instance option to keep this layer simpler.

> The new API solves makes thing simpler by allowing you to construct a
> stateful object to representing a generated changegroup and then get a
> changegroup representation various ways. It is much more explicit.

"Stateful" seems to be an important key to lift my confusion. Can you
elaborate?

>     * many @class method usage while our coding style usually avoid them
>     along other "class as namespace" usage (as per Matt taste),
> 
> 
> That's fine. These can be pulled out into module-level functions. e.g.
> emitterfromrootsandheads(repo, roots, heads) -> changegroupemitter
>  
> 
> 
>     * the
>     superlongnameforeverything.thatcombinewithclassnamspacing.foroverninethousand
>     charline we could probably me more compact using a abbreviation for
>     common parts as we do elsewhere in the code.
> 
> 
> I agree the names are a bit long. But I'm not a fan of abbreviations
> because they make readability hard, especially for those who don't speak
> English natively.

We use common abbreviation in the Mercurial codebase in many place: rev,
ctx, cg, obs, etc… I'm non-native speaker and code well with that. We
also have a wiki page with some details on those. We could update it and
point people to it more.

https://www.mercurial-scm.org/wiki/CodingStyle#Naming_conventions

> I think remaining "changegroupemitter" to "emitter" (because
> "changegroup" is redundant with the module name) and moving the
> @classmethods to module functions will make things much better. e.g.
> 
> emitter = changegroup.changegroupemitter.fromheadsandcommon(repo, heads, common)
> 
> emitter = changegroup.emitterfromheadsandcommon(repo, heads, common)

I agree, that would be a good step forward

> 
> Or if wanted to write a complex __init__ function, we could have it
> accept all arguments. But I don't like complexity and prefer separate
> "constructor" functions.

It seems like we have 4 possible arguments: outgoing, common, base,
heads. The complexity seems manageable. We could maybe have a factoring
function for outgoing working that way and just delegate to it in the
constructor.

Cheers,
Gregory Szorc - Aug. 4, 2016, 3:53 a.m.
On Wed, Aug 3, 2016 at 1:56 AM, Pierre-Yves David <
pierre-yves.david@ens-lyon.org> wrote:

> On 08/02/2016 04:50 AM, Gregory Szorc wrote:
> > On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David
> > <pierre-yves.david@ens-lyon.org <mailto:pierre-yves.david@ens-lyon.org>>
> > wrote:
> >
> >     On 08/01/2016 08:20 PM, Gregory Szorc wrote:
> >     > This is the first half of the series. The second half can be found
> at
> >     >
> https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if
> >     > someone wants to review the whole thing.
> >
> >     I've made a first review pass on this series. As much as I like the
> idea
> >     of cleaning this up, the proposed implementation makes me
> uncomfortable.
> >     The main points that troubles me are:
> >
> >     * the extra new class is a bit of a mix of (1) outgoing object (2)
> cg1
> >     bundler object. This makes a the boundary of these object more fuzzy
> >
> >
> > Not sure what you mean here. An outgoing object is basically a fancy
> > tuple holding the set of common and missing nodes.
>
> My point here is that Outgoing is a class meant to represent this
> information, across various user. We should aim at increasing it usage
> and visibility instead of hiding it under custom API exposing part of
> it. For example, the 'emitter.changesetcount' should probably just be
> `emiter.outgoing` with outgoing exposed elsewhere (eg: pushoperation)
> programmer will eventually just go "ho, an outgoing object, I know that".
>

I agree that discovery.outgoing is a decent type to represent changesets
that will be "sent" to another location or will represent the contents of a
changegroup. I think most of the code currently in
changegroup.changegroupsubset() should move to the discovery module. I
*don't* think discovery.outgoing instances should gain APIs to emit
changegroups, however. Instead, outgoing instances are things to be passed
to a changegroup generator instructing it how to behave.


>
> > I'm not sure what a "cg1 bundler" is. A non-bundle2 is just a v1
> changegroup with a 6 byte
> > header.
> >
> > If you are referring to returning a cg<N>unpacker instance, I'm not a
> > huge fan of that. I only did that for parity with existing consumers
> > because I didn't want to scope bloat to create a 40+ patch series.
>
> My point is that we already have class emiting changegroup data. I'm
> aware this area is not great. And I trust you in you will to clean up
> all this. However, it is currently unclear to me why we cannot improve
> them to fit your needs instead of introducing a new class.
>

I wanted to establish a common class to represent "pending changegroup
data" so that consumers all operated on a common type and could access
changegroup data any way they saw fit *and* in an explicit manner.. I view
this as simpler than requiring clients to hold onto a reference to sets of
nodes (or an outgoing instance) along with a generator or chunkbuffer. The
explicit part is important: the changegroup functions today have very
opaque return values. e.g. sometimes a "cg" variable refers to a generator
of chunks. Sometimes it is a chunkbuffer. By having a class where you call
"getX()" it is much easier to understand what the code is doing.


>
> > I do intend to do some follow-up work here because our handling of
> > changegroup generation is just bonkers. For example, sending a bundle 2
> > to the wire protocol with the current code involves:
> >
> > 1. Construct a cg<N>packer and call generate() to obtain a generator of
> > changegroup chunks
> > 2. Store a reference to the generator on a bundle2 part
> > 3. Convert generator to a util.chunkbuffer (a file-like object) as part
> > of bundlepart._payloadchunks() when serializing the bundle2 part
> > 4. Emit a generator of 4k chunks because that's what the
> > bundlepart.getchunks() does
> > 5. Feed this generator of chunks into a util.chunkbuffer holding the
> > entire bundle2 data (in exchange.getbundle())
> > 6. Construct a zlib compressor
> > 7. read(4096) from the util.chunkbuffer into the zlib compressor
> > 8. Emit a generator of zlib compressed chunks to be sent to WSGI
>
> To make sure I'm clear, I'm not challenging you diagnostic that there is
> improvement to be made. In your example. there is multiple things
> unrelated to the changegroup generation. I'm usually a proponent of
> improving existing layers instead of introducing a new ones (does not
> mean you cannot convince me that you approach here is right).
>

I argue I improved the changegroup API layer by consolidating 8 highly
redundant functions into a single new class that does everything they did
and more :)


>
> > util.chunkbuffer overhead has appeared in profiling results when cloning
> > Firefox. The thing is that none of these util.chunkbuffer are really
> > necessary for the wire protocol case: we could feed a generator of
> > chunks of various sizes into zlib then emit whatever chunks it spits
> > out, possibly normalizing them to equal sizes if we really wanted.
>
> Yep, there is a lot of small improvement we could do. As you point out
> the step (5) (6) could probably be easily simplified. But this seems
> unrelated to the current rework of changegroup emition. Is it?
>

It's difficult to do now because it isn't clear what each changegroup
function does or when it is appropriate to call.

>
>
> > I think the end state of this series involves a "changegroup emitter"
> > being fed into some kind of "bundler," preferably as a generator. Using
> > the cg1unpacker for this purpose is kind of silly, as it necessitates a
> > util.chunkbuffer. The good news is the "changegroupemitter" as it stands
> > can easily be represented as both a generator of chunks or a file
> > handle. So by introducing this object and requiring callers to use it,
> > we can start transitioning callers to not use util.chunkbuffer without
> > having to change APIs on changegroup.py. It's much more flexible.
>
> This paragraph highlight what seems to be the most important thing to
> address in this discussion. You seems to try to avoid touching the
> existing changegroup API as much as possible and I do not understand
> why. My general feeling is that it should be possible to upgrade the
> existing class to fit your needs, but your seems to think otherwise. I'm
> curious to find out why as you usually have pretty good reason for doing
> things?
>

I avoided changing the cg<N>packer and cg<N>unpacker classes because I
didn't want to write a 40 part series :)

The cg<N>unpacker class in particular is a bit gnarly because it performs
multiple tasks:

1) Exposes a file API from a possibly compressed input file object (read,
seek, tell, close)
2) Defines the composition of the changegroup framing protocol
3) Converts a file object to a generator of changegroup chunks
4) Applies a changegroup to a repo

This class is doing too much and needs to be split. To be honest, I'm not
sure how. I do know refactoring it will be a long series :)



>
> >     and I not convinced yet your cleanup could not happen with the
> >     existing objects,
> >
> >
> > The existing functions are horrible and not very useful. As this series
> > demonstrates, we had something like 6 functions producing either:
> >
> > * a generator of changegroup chunks
> > * a cg<N>unpacker instance
> >
> > from
> >
> > * an outgoing instance
> > * a set of bases and heads
> > * a set common nodes and heads
> >
> > sometimes with knobs to control the changegroup version.
> >
> > The API was unclear and far too low-level (e.g. sometimes you wanted to
> > access the number of changesets or the nodes in the changegroup
> > afterwards and the API hid these from you). To be honest, it wasn't
> > clear to me that this was how things worked until I wrote the series.
> > That's not a good sign.
>
> (Reminder: I'm aware the current code could use improvement)
>
> You new code seems to keep the 3 options for specifying the outgoing
> set. So I assume you focussed on getting a simpler return type. Can you
> elaborate on why do we need multiple return type in the first place?
>

Flexibility for callers. Now they can do things like obtain multiple
changegroup version representations from the same type. Perhaps you want to
write a server extension that stashes both a v1 and v2 bundle from pushed
data. It's now much simpler to do that.


>
> Also, could we simplify the other side of this. For example we could
> just keep the outgoing instance option to keep this layer simpler.
>
> > The new API solves makes thing simpler by allowing you to construct a
> > stateful object to representing a generated changegroup and then get a
> > changegroup representation various ways. It is much more explicit.
>
> "Stateful" seems to be an important key to lift my confusion. Can you
> elaborate?
>

Basically an entity that lets you obtain a representation of a changegroup
N ways without having to muck about low-level changegroup.py functions.


>
> >     * many @class method usage while our coding style usually avoid them
> >     along other "class as namespace" usage (as per Matt taste),
> >
> >
> > That's fine. These can be pulled out into module-level functions. e.g.
> > emitterfromrootsandheads(repo, roots, heads) -> changegroupemitter
> >
> >
> >
> >     * the
> >
>  superlongnameforeverything.thatcombinewithclassnamspacing.foroverninethousand
> >     charline we could probably me more compact using a abbreviation for
> >     common parts as we do elsewhere in the code.
> >
> >
> > I agree the names are a bit long. But I'm not a fan of abbreviations
> > because they make readability hard, especially for those who don't speak
> > English natively.
>
> We use common abbreviation in the Mercurial codebase in many place: rev,
> ctx, cg, obs, etc… I'm non-native speaker and code well with that. We
> also have a wiki page with some details on those. We could update it and
> point people to it more.
>
> https://www.mercurial-scm.org/wiki/CodingStyle#Naming_conventions
>
> > I think remaining "changegroupemitter" to "emitter" (because
> > "changegroup" is redundant with the module name) and moving the
> > @classmethods to module functions will make things much better. e.g.
> >
> > emitter = changegroup.changegroupemitter.fromheadsandcommon(repo, heads,
> common)
> >
> > emitter = changegroup.emitterfromheadsandcommon(repo, heads, common)
>
> I agree, that would be a good step forward
>
> >
> > Or if wanted to write a complex __init__ function, we could have it
> > accept all arguments. But I don't like complexity and prefer separate
> > "constructor" functions.
>
> It seems like we have 4 possible arguments: outgoing, common, base,
> heads. The complexity seems manageable. We could maybe have a factoring
> function for outgoing working that way and just delegate to it in the
> constructor.
>
> Cheers,
>
> --
> Pierre-Yves David
>
Gregory Szorc - Aug. 4, 2016, 2:44 p.m.
On Aug 4, 2016, at 06:58, Kevin Bullock <kbullock+mercurial@ringworld.org> wrote:

>> On Aug 3, 2016, at 22:53, Gregory Szorc <gregory.szorc@gmail.com> wrote:
>> 
>>>> On Wed, Aug 3, 2016 at 1:56 AM, Pierre-Yves David <pierre-yves.david@ens-lyon.org> wrote:
>>>> On 08/02/2016 04:50 AM, Gregory Szorc wrote:
>>>> 
>>>> I think the end state of this series involves a "changegroup emitter"
>>>> being fed into some kind of "bundler," preferably as a generator. Using
>>>> the cg1unpacker for this purpose is kind of silly, as it necessitates a
>>>> util.chunkbuffer. The good news is the "changegroupemitter" as it stands
>>>> can easily be represented as both a generator of chunks or a file
>>>> handle. So by introducing this object and requiring callers to use it,
>>>> we can start transitioning callers to not use util.chunkbuffer without
>>>> having to change APIs on changegroup.py. It's much more flexible.
>>> This paragraph highlight what seems to be the most important thing to
>>> address in this discussion. You seems to try to avoid touching the
>>> existing changegroup API as much as possible and I do not understand
>>> why. My general feeling is that it should be possible to upgrade the
>>> existing class to fit your needs, but your seems to think otherwise. I'm
>>> curious to find out why as you usually have pretty good reason for doing
>>> things?
>> I avoided changing the cg<N>packer and cg<N>unpacker classes because I didn't want to write a 40 part series :)
>> 
>> The cg<N>unpacker class in particular is a bit gnarly because it performs multiple tasks:
>> 
>> 1) Exposes a file API from a possibly compressed input file object (read, seek, tell, close)
>> 2) Defines the composition of the changegroup framing protocol
>> 3) Converts a file object to a generator of changegroup chunks
>> 4) Applies a changegroup to a repo
>> 
>> This class is doing too much and needs to be split. To be honest, I'm not sure how. I do know refactoring it will be a long series :)
> 
> Sounds like we should come up with a strategy for this at the sprint, if it's not resolved by then. I've added it to the wiki page.

I want to work this out prior, as it's related to some performance improvements I'd like to get into the next release. I think Pierre-Yves and I can find time to agree before October :)
Pierre-Yves David - Aug. 6, 2016, 10:02 p.m.
On 08/04/2016 05:53 AM, Gregory Szorc wrote:
> On Wed, Aug 3, 2016 at 1:56 AM, Pierre-Yves David
> <pierre-yves.david@ens-lyon.org <mailto:pierre-yves.david@ens-lyon.org>>
> wrote:
>
>     On 08/02/2016 04:50 AM, Gregory Szorc wrote:
>     > On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David
>     > <pierre-yves.david@ens-lyon.org
>     <mailto:pierre-yves.david@ens-lyon.org>
>     <mailto:pierre-yves.david@ens-lyon.org
>     <mailto:pierre-yves.david@ens-lyon.org>>>
>     > wrote:
>     >
>     >     On 08/01/2016 08:20 PM, Gregory Szorc wrote:
>     >     > This is the first half of the series. The second half can be found at
>     >     > https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if
>     >     > someone wants to review the whole thing.
>     >
>     >     I've made a first review pass on this series. As much as I like the idea
>     >     of cleaning this up, the proposed implementation makes me uncomfortable.
>     >     The main points that troubles me are:
>     >
>     >     * the extra new class is a bit of a mix of (1) outgoing object (2) cg1
>     >     bundler object. This makes a the boundary of these object more fuzzy
>     >
>     >
>     > Not sure what you mean here. An outgoing object is basically a fancy
>     > tuple holding the set of common and missing nodes.
>
>     My point here is that Outgoing is a class meant to represent this
>     information, across various user. We should aim at increasing it usage
>     and visibility instead of hiding it under custom API exposing part of
>     it. For example, the 'emitter.changesetcount' should probably just be
>     `emiter.outgoing` with outgoing exposed elsewhere (eg: pushoperation)
>     programmer will eventually just go "ho, an outgoing object, I know
>     that".
>
>
> I agree that discovery.outgoing is a decent type to represent changesets
> that will be "sent" to another location or will represent the contents
> of a changegroup. I think most of the code currently in
> changegroup.changegroupsubset() should move to the discovery module. I
> *don't* think discovery.outgoing instances should gain APIs to emit
> changegroups, however. Instead, outgoing instances are things to be
> passed to a changegroup generator instructing it how to behave.

We are on the same page here. the outgoing object is meant to be 
descriptive only. If might get more data to get more descriptive, but 
adding any generation to it should be avoided.

>     > I'm not sure what a "cg1 bundler" is. A non-bundle2 is just a v1 changegroup with a 6 byte
>     > header.
>     >
>     > If you are referring to returning a cg<N>unpacker instance, I'm not a
>     > huge fan of that. I only did that for parity with existing consumers
>     > because I didn't want to scope bloat to create a 40+ patch series.
>
>     My point is that we already have class emiting changegroup data. I'm
>     aware this area is not great. And I trust you in you will to clean up
>     all this. However, it is currently unclear to me why we cannot improve
>     them to fit your needs instead of introducing a new class.
>
>
> I wanted to establish a common class to represent "pending changegroup
> data" so that consumers all operated on a common type and could access
> changegroup data any way they saw fit *and* in an explicit manner.. I
> view this as simpler than requiring clients to hold onto a reference to
> sets of
> nodes (or an outgoing instance) along with a generator or chunkbuffer.
> The explicit part is important: the changegroup functions today have
> very opaque return values. e.g. sometimes a "cg" variable refers to a
> generator of chunks. Sometimes it is a chunkbuffer. By having a class
> where you call "getX()" it is much easier to understand what the code is
> doing.

My main objection here is about adding yet another type of object. The 
changegroup generation stack is already quite complicated with at least 
three thing I can think on the top of my head (packer, unpacker, 
bundler). I do not see the addition of a new one as an improvement 
(unless we manage to kill another one).

In the case you are mentionning, these iterators/chunkbuffer come from 
somewhere. We could probably improve the object that provide what their 
consumer needs.

>     > I do intend to do some follow-up work here because our handling of
>     > changegroup generation is just bonkers. For example, sending a bundle 2
>     > to the wire protocol with the current code involves:
>     >
>     > 1. Construct a cg<N>packer and call generate() to obtain a generator of
>     > changegroup chunks
>     > 2. Store a reference to the generator on a bundle2 part
>     > 3. Convert generator to a util.chunkbuffer (a file-like object) as part
>     > of bundlepart._payloadchunks() when serializing the bundle2 part
>     > 4. Emit a generator of 4k chunks because that's what the
>     > bundlepart.getchunks() does
>     > 5. Feed this generator of chunks into a util.chunkbuffer holding the
>     > entire bundle2 data (in exchange.getbundle())
>     > 6. Construct a zlib compressor
>     > 7. read(4096) from the util.chunkbuffer into the zlib compressor
>     > 8. Emit a generator of zlib compressed chunks to be sent to WSGI
>
>     To make sure I'm clear, I'm not challenging you diagnostic that there is
>     improvement to be made. In your example. there is multiple things
>     unrelated to the changegroup generation. I'm usually a proponent of
>     improving existing layers instead of introducing a new ones (does not
>     mean you cannot convince me that you approach here is right).
>
> I argue I improved the changegroup API layer by consolidating 8 highly
> redundant functions into a single new class that does everything they
> did and more :)


To me it sounded a bit like new class with 8 redundant methods and a 
task of existing object. I would like to be optimistic a believe we can 
consolidate the API without making more layers.

>     > util.chunkbuffer overhead has appeared in profiling results when cloning
>     > Firefox. The thing is that none of these util.chunkbuffer are really
>     > necessary for the wire protocol case: we could feed a generator of
>     > chunks of various sizes into zlib then emit whatever chunks it spits
>     > out, possibly normalizing them to equal sizes if we really wanted.
>
>     Yep, there is a lot of small improvement we could do. As you point out
>     the step (5) (6) could probably be easily simplified. But this seems
>     unrelated to the current rework of changegroup emition. Is it?
>
>
> It's difficult to do now because it isn't clear what each changegroup
> function does or when it is appropriate to call.

(reminder, I'm not challenging that the current API is confusing.)

>     > I think the end state of this series involves a "changegroup emitter"
>     > being fed into some kind of "bundler," preferably as a generator. Using
>     > the cg1unpacker for this purpose is kind of silly, as it necessitates a
>     > util.chunkbuffer. The good news is the "changegroupemitter" as it stands
>     > can easily be represented as both a generator of chunks or a file
>     > handle. So by introducing this object and requiring callers to use it,
>     > we can start transitioning callers to not use util.chunkbuffer without
>     > having to change APIs on changegroup.py. It's much more flexible.
>
>     This paragraph highlight what seems to be the most important thing to
>     address in this discussion. You seems to try to avoid touching the
>     existing changegroup API as much as possible and I do not understand
>     why. My general feeling is that it should be possible to upgrade the
>     existing class to fit your needs, but your seems to think otherwise. I'm
>     curious to find out why as you usually have pretty good reason for doing
>     things?
>
>
> I avoided changing the cg<N>packer and cg<N>unpacker classes because I
> didn't want to write a 40 part series :)

You are talking about 40 patches series as if it was something bad. I 
know you can do patch series that are x5 longer ;-)

> The cg<N>unpacker class in particular is a bit gnarly because it
> performs multiple tasks:
>
> 1) Exposes a file API from a possibly compressed input file object
> (read, seek, tell, close)
> 2) Defines the composition of the changegroup framing protocol
> 3) Converts a file object to a generator of changegroup chunks
> 4) Applies a changegroup to a repo
>
> This class is doing too much and needs to be split. To be honest, I'm
> not sure how. I do know refactoring it will be a long series :)

I think I'm starting to understant your issue here. I'll have to have 
closer look at what all these class are doing. (or we could discuss it 
if you have it all loaded in your head).

>     >     and I not convinced yet your cleanup could not happen with the
>     >     existing objects,
>     >
>     >
>     > The existing functions are horrible and not very useful. As this series
>     > demonstrates, we had something like 6 functions producing either:
>     >
>     > * a generator of changegroup chunks
>     > * a cg<N>unpacker instance
>     >
>     > from
>     >
>     > * an outgoing instance
>     > * a set of bases and heads
>     > * a set common nodes and heads
>     >
>     > sometimes with knobs to control the changegroup version.
>     >
>     > The API was unclear and far too low-level (e.g. sometimes you wanted to
>     > access the number of changesets or the nodes in the changegroup
>     > afterwards and the API hid these from you). To be honest, it wasn't
>     > clear to me that this was how things worked until I wrote the series.
>     > That's not a good sign.
>
>     (Reminder: I'm aware the current code could use improvement)
>
>     You new code seems to keep the 3 options for specifying the outgoing
>     set. So I assume you focussed on getting a simpler return type. Can you
>     elaborate on why do we need multiple return type in the first place?
>
>
> Flexibility for callers. Now they can do things like obtain multiple
> changegroup version representations from the same type. Perhaps you want
> to write a server extension that stashes both a v1 and v2 bundle from
> pushed data. It's now much simpler to do that.

I believe it would be better to clean up the current code before adding 
more advance feature to it. We might end not not needing this in core if 
the API is nice enough

Cheers,

Patch

diff --git a/mercurial/exchange.py b/mercurial/exchange.py
--- a/mercurial/exchange.py
+++ b/mercurial/exchange.py
@@ -1547,39 +1547,42 @@  def getbundle(repo, source, heads=None, 
              **kwargs)
 
     return util.chunkbuffer(bundler.getchunks())
 
 @getbundle2partsgenerator('changegroup')
 def _getbundlechangegrouppart(bundler, repo, source, bundlecaps=None,
                               b2caps=None, heads=None, common=None, **kwargs):
     """add a changegroup part to the requested bundle"""
-    cg = None
-    if kwargs.get('cg', True):
-        # build changegroup bundle here.
-        version = '01'
-        cgversions = b2caps.get('changegroup')
-        if cgversions:  # 3.1 and 3.2 ship with an empty value
-            cgversions = [v for v in cgversions
-                          if v in changegroup.supportedoutgoingversions(repo)]
-            if not cgversions:
-                raise ValueError(_('no common changegroup version'))
-            version = max(cgversions)
-        outgoing = changegroup.computeoutgoing(repo, heads, common)
-        cg = changegroup.getlocalchangegroupraw(repo, source, outgoing,
-                                                bundlecaps=bundlecaps,
-                                                version=version)
+    if not kwargs.get('cg', True):
+        return
 
-    if cg:
-        part = bundler.newpart('changegroup', data=cg)
-        if cgversions:
-            part.addparam('version', version)
-        part.addparam('nbchanges', str(len(outgoing.missing)), mandatory=False)
-        if 'treemanifest' in repo.requirements:
-            part.addparam('treemanifest', '1')
+    # build changegroup bundle here.
+    version = '01'
+    cgversions = b2caps.get('changegroup')
+    if cgversions:  # 3.1 and 3.2 ship with an empty value
+        cgversions = [v for v in cgversions
+                      if v in changegroup.supportedoutgoingversions(repo)]
+        if not cgversions:
+            raise ValueError(_('no common changegroup version'))
+        version = max(cgversions)
+    outgoing = changegroup.computeoutgoing(repo, heads, common)
+    cg = changegroup.getlocalchangegroupraw(repo, source, outgoing,
+                                            bundlecaps=bundlecaps,
+                                            version=version)
+
+    if not cg:
+        return
+
+    part = bundler.newpart('changegroup', data=cg)
+    if cgversions:
+        part.addparam('version', version)
+    part.addparam('nbchanges', str(len(outgoing.missing)), mandatory=False)
+    if 'treemanifest' in repo.requirements:
+        part.addparam('treemanifest', '1')
 
 @getbundle2partsgenerator('listkeys')
 def _getbundlelistkeysparts(bundler, repo, source, bundlecaps=None,
                             b2caps=None, **kwargs):
     """add parts containing listkeys namespaces to the requested bundle"""
     listkeys = kwargs.get('listkeys', ())
     for namespace in listkeys:
         part = bundler.newpart('listkeys')