Submitter | Gregory Szorc |
---|---|
Date | Aug. 1, 2016, 6:18 p.m. |
Message ID | <18275f2b6d2b8fcb50ec.1470075498@ubuntu-vm-main> |
Download | mbox | patch |
Permalink | /patch/16015/ |
State | Changes Requested |
Delegated to: | Pierre-Yves David |
Headers | show |
Comments
This is the first half of the series. The second half can be found at https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if someone wants to review the whole thing. Word of warning: if you pull that repo, you'll get tons of obsolescence markers. You may want to disable obsolescence when pulling. On Mon, Aug 1, 2016 at 11:18 AM, Gregory Szorc <gregory.szorc@gmail.com> wrote: > # HG changeset patch > # User Gregory Szorc <gregory.szorc@gmail.com> > # Date 1469900681 25200 > # Sat Jul 30 10:44:41 2016 -0700 > # Node ID 18275f2b6d2b8fcb50ecaf331011e3d233ca1cfa > # Parent 73ff159923c1f05899c27238409ca398342d9ae0 > exchange: use early return > > Avoid excessive indentation and confusing control flow by returning > early if there is no work to be done. > > diff --git a/mercurial/exchange.py b/mercurial/exchange.py > --- a/mercurial/exchange.py > +++ b/mercurial/exchange.py > @@ -1547,39 +1547,42 @@ def getbundle(repo, source, heads=None, > **kwargs) > > return util.chunkbuffer(bundler.getchunks()) > > @getbundle2partsgenerator('changegroup') > def _getbundlechangegrouppart(bundler, repo, source, bundlecaps=None, > b2caps=None, heads=None, common=None, > **kwargs): > """add a changegroup part to the requested bundle""" > - cg = None > - if kwargs.get('cg', True): > - # build changegroup bundle here. > - version = '01' > - cgversions = b2caps.get('changegroup') > - if cgversions: # 3.1 and 3.2 ship with an empty value > - cgversions = [v for v in cgversions > - if v in > changegroup.supportedoutgoingversions(repo)] > - if not cgversions: > - raise ValueError(_('no common changegroup version')) > - version = max(cgversions) > - outgoing = changegroup.computeoutgoing(repo, heads, common) > - cg = changegroup.getlocalchangegroupraw(repo, source, outgoing, > - bundlecaps=bundlecaps, > - version=version) > + if not kwargs.get('cg', True): > + return > > - if cg: > - part = bundler.newpart('changegroup', data=cg) > - if cgversions: > - part.addparam('version', version) > - part.addparam('nbchanges', str(len(outgoing.missing)), > mandatory=False) > - if 'treemanifest' in repo.requirements: > - part.addparam('treemanifest', '1') > + # build changegroup bundle here. > + version = '01' > + cgversions = b2caps.get('changegroup') > + if cgversions: # 3.1 and 3.2 ship with an empty value > + cgversions = [v for v in cgversions > + if v in changegroup.supportedoutgoingversions(repo)] > + if not cgversions: > + raise ValueError(_('no common changegroup version')) > + version = max(cgversions) > + outgoing = changegroup.computeoutgoing(repo, heads, common) > + cg = changegroup.getlocalchangegroupraw(repo, source, outgoing, > + bundlecaps=bundlecaps, > + version=version) > + > + if not cg: > + return > + > + part = bundler.newpart('changegroup', data=cg) > + if cgversions: > + part.addparam('version', version) > + part.addparam('nbchanges', str(len(outgoing.missing)), > mandatory=False) > + if 'treemanifest' in repo.requirements: > + part.addparam('treemanifest', '1') > > @getbundle2partsgenerator('listkeys') > def _getbundlelistkeysparts(bundler, repo, source, bundlecaps=None, > b2caps=None, **kwargs): > """add parts containing listkeys namespaces to the requested bundle""" > listkeys = kwargs.get('listkeys', ()) > for namespace in listkeys: > part = bundler.newpart('listkeys') >
On 08/01/2016 08:20 PM, Gregory Szorc wrote: > This is the first half of the series. The second half can be found at > https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if > someone wants to review the whole thing. I've made a first review pass on this series. As much as I like the idea of cleaning this up, the proposed implementation makes me uncomfortable. The main points that troubles me are: * the extra new class is a bit of a mix of (1) outgoing object (2) cg1 bundler object. This makes a the boundary of these object more fuzzy and I not convinced yet your cleanup could not happen with the existing objects, * many @class method usage while our coding style usually avoid them along other "class as namespace" usage (as per Matt taste), * the superlongnameforeverything.thatcombinewithclassnamspacing.foroverninethousand charline we could probably me more compact using a abbreviation for common parts as we do elsewhere in the code. Let me do a full scan of the second part of the series tomorrow and then I either understand the purity of your way, or find some super precise feed back, or we might discuss it over voice. > Word of warning: if you pull that repo, you'll get tons of obsolescence > markers. You may want to disable obsolescence when pulling. These are legit markers that do not exist in the repository because we use email and we somewhat managed to not pull from you in the past. We should be okay pushing them to the main server (the marker/changeset ratio is a bit high but fine). Cheers,
On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David < pierre-yves.david@ens-lyon.org> wrote: > > > On 08/01/2016 08:20 PM, Gregory Szorc wrote: > > This is the first half of the series. The second half can be found at > > https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if > > someone wants to review the whole thing. > > I've made a first review pass on this series. As much as I like the idea > of cleaning this up, the proposed implementation makes me uncomfortable. > The main points that troubles me are: > > * the extra new class is a bit of a mix of (1) outgoing object (2) cg1 > bundler object. This makes a the boundary of these object more fuzzy Not sure what you mean here. An outgoing object is basically a fancy tuple holding the set of common and missing nodes. I'm not sure what a "cg1 bundler" is. A non-bundle2 is just a v1 changegroup with a 6 byte header. If you are referring to returning a cg<N>unpacker instance, I'm not a huge fan of that. I only did that for parity with existing consumers because I didn't want to scope bloat to create a 40+ patch series. I do intend to do some follow-up work here because our handling of changegroup generation is just bonkers. For example, sending a bundle 2 to the wire protocol with the current code involves: 1. Construct a cg<N>packer and call generate() to obtain a generator of changegroup chunks 2. Store a reference to the generator on a bundle2 part 3. Convert generator to a util.chunkbuffer (a file-like object) as part of bundlepart._payloadchunks() when serializing the bundle2 part 4. Emit a generator of 4k chunks because that's what the bundlepart.getchunks() does 5. Feed this generator of chunks into a util.chunkbuffer holding the entire bundle2 data (in exchange.getbundle()) 6. Construct a zlib compressor 7. read(4096) from the util.chunkbuffer into the zlib compressor 8. Emit a generator of zlib compressed chunks to be sent to WSGI util.chunkbuffer overhead has appeared in profiling results when cloning Firefox. The thing is that none of these util.chunkbuffer are really necessary for the wire protocol case: we could feed a generator of chunks of various sizes into zlib then emit whatever chunks it spits out, possibly normalizing them to equal sizes if we really wanted. I think the end state of this series involves a "changegroup emitter" being fed into some kind of "bundler," preferably as a generator. Using the cg1unpacker for this purpose is kind of silly, as it necessitates a util.chunkbuffer. The good news is the "changegroupemitter" as it stands can easily be represented as both a generator of chunks or a file handle. So by introducing this object and requiring callers to use it, we can start transitioning callers to not use util.chunkbuffer without having to change APIs on changegroup.py. It's much more flexible. > and I not convinced yet your cleanup could not happen with the existing > objects, > The existing functions are horrible and not very useful. As this series demonstrates, we had something like 6 functions producing either: * a generator of changegroup chunks * a cg<N>unpacker instance from * an outgoing instance * a set of bases and heads * a set common nodes and heads sometimes with knobs to control the changegroup version. The API was unclear and far too low-level (e.g. sometimes you wanted to access the number of changesets or the nodes in the changegroup afterwards and the API hid these from you). To be honest, it wasn't clear to me that this was how things worked until I wrote the series. That's not a good sign. The new API solves makes thing simpler by allowing you to construct a stateful object to representing a generated changegroup and then get a changegroup representation various ways. It is much more explicit. > * many @class method usage while our coding style usually avoid them > along other "class as namespace" usage (as per Matt taste), > That's fine. These can be pulled out into module-level functions. e.g. emitterfromrootsandheads(repo, roots, heads) -> changegroupemitter > > * the > > superlongnameforeverything.thatcombinewithclassnamspacing.foroverninethousand > charline we could probably me more compact using a abbreviation for > common parts as we do elsewhere in the code. > I agree the names are a bit long. But I'm not a fan of abbreviations because they make readability hard, especially for those who don't speak English natively. I think remaining "changegroupemitter" to "emitter" (because "changegroup" is redundant with the module name) and moving the @classmethods to module functions will make things much better. e.g. emitter = changegroup.changegroupemitter.fromheadsandcommon(repo, heads, common) emitter = changegroup.emitterfromheadsandcommon(repo, heads, common) Or if wanted to write a complex __init__ function, we could have it accept all arguments. But I don't like complexity and prefer separate "constructor" functions. > Let me do a full scan of the second part of the series tomorrow and then > I either understand the purity of your way, or find some super precise > feed back, or we might discuss it over voice. > > > Word of warning: if you pull that repo, you'll get tons of obsolescence > > markers. You may want to disable obsolescence when pulling. > > These are legit markers that do not exist in the repository because we > use email and we somewhat managed to not pull from you in the past. We > should be okay pushing them to the main server (the marker/changeset > ratio is a bit high but fine). > > Cheers, > > -- > Pierre-Yves David >
On 08/02/2016 04:50 AM, Gregory Szorc wrote: > On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David > <pierre-yves.david@ens-lyon.org <mailto:pierre-yves.david@ens-lyon.org>> > wrote: > > On 08/01/2016 08:20 PM, Gregory Szorc wrote: > > This is the first half of the series. The second half can be found at > > https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if > > someone wants to review the whole thing. > > I've made a first review pass on this series. As much as I like the idea > of cleaning this up, the proposed implementation makes me uncomfortable. > The main points that troubles me are: > > * the extra new class is a bit of a mix of (1) outgoing object (2) cg1 > bundler object. This makes a the boundary of these object more fuzzy > > > Not sure what you mean here. An outgoing object is basically a fancy > tuple holding the set of common and missing nodes. My point here is that Outgoing is a class meant to represent this information, across various user. We should aim at increasing it usage and visibility instead of hiding it under custom API exposing part of it. For example, the 'emitter.changesetcount' should probably just be `emiter.outgoing` with outgoing exposed elsewhere (eg: pushoperation) programmer will eventually just go "ho, an outgoing object, I know that". > I'm not sure what a "cg1 bundler" is. A non-bundle2 is just a v1 changegroup with a 6 byte > header. > > If you are referring to returning a cg<N>unpacker instance, I'm not a > huge fan of that. I only did that for parity with existing consumers > because I didn't want to scope bloat to create a 40+ patch series. My point is that we already have class emiting changegroup data. I'm aware this area is not great. And I trust you in you will to clean up all this. However, it is currently unclear to me why we cannot improve them to fit your needs instead of introducing a new class. > I do intend to do some follow-up work here because our handling of > changegroup generation is just bonkers. For example, sending a bundle 2 > to the wire protocol with the current code involves: > > 1. Construct a cg<N>packer and call generate() to obtain a generator of > changegroup chunks > 2. Store a reference to the generator on a bundle2 part > 3. Convert generator to a util.chunkbuffer (a file-like object) as part > of bundlepart._payloadchunks() when serializing the bundle2 part > 4. Emit a generator of 4k chunks because that's what the > bundlepart.getchunks() does > 5. Feed this generator of chunks into a util.chunkbuffer holding the > entire bundle2 data (in exchange.getbundle()) > 6. Construct a zlib compressor > 7. read(4096) from the util.chunkbuffer into the zlib compressor > 8. Emit a generator of zlib compressed chunks to be sent to WSGI To make sure I'm clear, I'm not challenging you diagnostic that there is improvement to be made. In your example. there is multiple things unrelated to the changegroup generation. I'm usually a proponent of improving existing layers instead of introducing a new ones (does not mean you cannot convince me that you approach here is right). > util.chunkbuffer overhead has appeared in profiling results when cloning > Firefox. The thing is that none of these util.chunkbuffer are really > necessary for the wire protocol case: we could feed a generator of > chunks of various sizes into zlib then emit whatever chunks it spits > out, possibly normalizing them to equal sizes if we really wanted. Yep, there is a lot of small improvement we could do. As you point out the step (5) (6) could probably be easily simplified. But this seems unrelated to the current rework of changegroup emition. Is it? > I think the end state of this series involves a "changegroup emitter" > being fed into some kind of "bundler," preferably as a generator. Using > the cg1unpacker for this purpose is kind of silly, as it necessitates a > util.chunkbuffer. The good news is the "changegroupemitter" as it stands > can easily be represented as both a generator of chunks or a file > handle. So by introducing this object and requiring callers to use it, > we can start transitioning callers to not use util.chunkbuffer without > having to change APIs on changegroup.py. It's much more flexible. This paragraph highlight what seems to be the most important thing to address in this discussion. You seems to try to avoid touching the existing changegroup API as much as possible and I do not understand why. My general feeling is that it should be possible to upgrade the existing class to fit your needs, but your seems to think otherwise. I'm curious to find out why as you usually have pretty good reason for doing things? > and I not convinced yet your cleanup could not happen with the > existing objects, > > > The existing functions are horrible and not very useful. As this series > demonstrates, we had something like 6 functions producing either: > > * a generator of changegroup chunks > * a cg<N>unpacker instance > > from > > * an outgoing instance > * a set of bases and heads > * a set common nodes and heads > > sometimes with knobs to control the changegroup version. > > The API was unclear and far too low-level (e.g. sometimes you wanted to > access the number of changesets or the nodes in the changegroup > afterwards and the API hid these from you). To be honest, it wasn't > clear to me that this was how things worked until I wrote the series. > That's not a good sign. (Reminder: I'm aware the current code could use improvement) You new code seems to keep the 3 options for specifying the outgoing set. So I assume you focussed on getting a simpler return type. Can you elaborate on why do we need multiple return type in the first place? Also, could we simplify the other side of this. For example we could just keep the outgoing instance option to keep this layer simpler. > The new API solves makes thing simpler by allowing you to construct a > stateful object to representing a generated changegroup and then get a > changegroup representation various ways. It is much more explicit. "Stateful" seems to be an important key to lift my confusion. Can you elaborate? > * many @class method usage while our coding style usually avoid them > along other "class as namespace" usage (as per Matt taste), > > > That's fine. These can be pulled out into module-level functions. e.g. > emitterfromrootsandheads(repo, roots, heads) -> changegroupemitter > > > > * the > superlongnameforeverything.thatcombinewithclassnamspacing.foroverninethousand > charline we could probably me more compact using a abbreviation for > common parts as we do elsewhere in the code. > > > I agree the names are a bit long. But I'm not a fan of abbreviations > because they make readability hard, especially for those who don't speak > English natively. We use common abbreviation in the Mercurial codebase in many place: rev, ctx, cg, obs, etc… I'm non-native speaker and code well with that. We also have a wiki page with some details on those. We could update it and point people to it more. https://www.mercurial-scm.org/wiki/CodingStyle#Naming_conventions > I think remaining "changegroupemitter" to "emitter" (because > "changegroup" is redundant with the module name) and moving the > @classmethods to module functions will make things much better. e.g. > > emitter = changegroup.changegroupemitter.fromheadsandcommon(repo, heads, common) > > emitter = changegroup.emitterfromheadsandcommon(repo, heads, common) I agree, that would be a good step forward > > Or if wanted to write a complex __init__ function, we could have it > accept all arguments. But I don't like complexity and prefer separate > "constructor" functions. It seems like we have 4 possible arguments: outgoing, common, base, heads. The complexity seems manageable. We could maybe have a factoring function for outgoing working that way and just delegate to it in the constructor. Cheers,
On Wed, Aug 3, 2016 at 1:56 AM, Pierre-Yves David < pierre-yves.david@ens-lyon.org> wrote: > On 08/02/2016 04:50 AM, Gregory Szorc wrote: > > On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David > > <pierre-yves.david@ens-lyon.org <mailto:pierre-yves.david@ens-lyon.org>> > > wrote: > > > > On 08/01/2016 08:20 PM, Gregory Szorc wrote: > > > This is the first half of the series. The second half can be found > at > > > > https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if > > > someone wants to review the whole thing. > > > > I've made a first review pass on this series. As much as I like the > idea > > of cleaning this up, the proposed implementation makes me > uncomfortable. > > The main points that troubles me are: > > > > * the extra new class is a bit of a mix of (1) outgoing object (2) > cg1 > > bundler object. This makes a the boundary of these object more fuzzy > > > > > > Not sure what you mean here. An outgoing object is basically a fancy > > tuple holding the set of common and missing nodes. > > My point here is that Outgoing is a class meant to represent this > information, across various user. We should aim at increasing it usage > and visibility instead of hiding it under custom API exposing part of > it. For example, the 'emitter.changesetcount' should probably just be > `emiter.outgoing` with outgoing exposed elsewhere (eg: pushoperation) > programmer will eventually just go "ho, an outgoing object, I know that". > I agree that discovery.outgoing is a decent type to represent changesets that will be "sent" to another location or will represent the contents of a changegroup. I think most of the code currently in changegroup.changegroupsubset() should move to the discovery module. I *don't* think discovery.outgoing instances should gain APIs to emit changegroups, however. Instead, outgoing instances are things to be passed to a changegroup generator instructing it how to behave. > > > I'm not sure what a "cg1 bundler" is. A non-bundle2 is just a v1 > changegroup with a 6 byte > > header. > > > > If you are referring to returning a cg<N>unpacker instance, I'm not a > > huge fan of that. I only did that for parity with existing consumers > > because I didn't want to scope bloat to create a 40+ patch series. > > My point is that we already have class emiting changegroup data. I'm > aware this area is not great. And I trust you in you will to clean up > all this. However, it is currently unclear to me why we cannot improve > them to fit your needs instead of introducing a new class. > I wanted to establish a common class to represent "pending changegroup data" so that consumers all operated on a common type and could access changegroup data any way they saw fit *and* in an explicit manner.. I view this as simpler than requiring clients to hold onto a reference to sets of nodes (or an outgoing instance) along with a generator or chunkbuffer. The explicit part is important: the changegroup functions today have very opaque return values. e.g. sometimes a "cg" variable refers to a generator of chunks. Sometimes it is a chunkbuffer. By having a class where you call "getX()" it is much easier to understand what the code is doing. > > > I do intend to do some follow-up work here because our handling of > > changegroup generation is just bonkers. For example, sending a bundle 2 > > to the wire protocol with the current code involves: > > > > 1. Construct a cg<N>packer and call generate() to obtain a generator of > > changegroup chunks > > 2. Store a reference to the generator on a bundle2 part > > 3. Convert generator to a util.chunkbuffer (a file-like object) as part > > of bundlepart._payloadchunks() when serializing the bundle2 part > > 4. Emit a generator of 4k chunks because that's what the > > bundlepart.getchunks() does > > 5. Feed this generator of chunks into a util.chunkbuffer holding the > > entire bundle2 data (in exchange.getbundle()) > > 6. Construct a zlib compressor > > 7. read(4096) from the util.chunkbuffer into the zlib compressor > > 8. Emit a generator of zlib compressed chunks to be sent to WSGI > > To make sure I'm clear, I'm not challenging you diagnostic that there is > improvement to be made. In your example. there is multiple things > unrelated to the changegroup generation. I'm usually a proponent of > improving existing layers instead of introducing a new ones (does not > mean you cannot convince me that you approach here is right). > I argue I improved the changegroup API layer by consolidating 8 highly redundant functions into a single new class that does everything they did and more :) > > > util.chunkbuffer overhead has appeared in profiling results when cloning > > Firefox. The thing is that none of these util.chunkbuffer are really > > necessary for the wire protocol case: we could feed a generator of > > chunks of various sizes into zlib then emit whatever chunks it spits > > out, possibly normalizing them to equal sizes if we really wanted. > > Yep, there is a lot of small improvement we could do. As you point out > the step (5) (6) could probably be easily simplified. But this seems > unrelated to the current rework of changegroup emition. Is it? > It's difficult to do now because it isn't clear what each changegroup function does or when it is appropriate to call. > > > > I think the end state of this series involves a "changegroup emitter" > > being fed into some kind of "bundler," preferably as a generator. Using > > the cg1unpacker for this purpose is kind of silly, as it necessitates a > > util.chunkbuffer. The good news is the "changegroupemitter" as it stands > > can easily be represented as both a generator of chunks or a file > > handle. So by introducing this object and requiring callers to use it, > > we can start transitioning callers to not use util.chunkbuffer without > > having to change APIs on changegroup.py. It's much more flexible. > > This paragraph highlight what seems to be the most important thing to > address in this discussion. You seems to try to avoid touching the > existing changegroup API as much as possible and I do not understand > why. My general feeling is that it should be possible to upgrade the > existing class to fit your needs, but your seems to think otherwise. I'm > curious to find out why as you usually have pretty good reason for doing > things? > I avoided changing the cg<N>packer and cg<N>unpacker classes because I didn't want to write a 40 part series :) The cg<N>unpacker class in particular is a bit gnarly because it performs multiple tasks: 1) Exposes a file API from a possibly compressed input file object (read, seek, tell, close) 2) Defines the composition of the changegroup framing protocol 3) Converts a file object to a generator of changegroup chunks 4) Applies a changegroup to a repo This class is doing too much and needs to be split. To be honest, I'm not sure how. I do know refactoring it will be a long series :) > > > and I not convinced yet your cleanup could not happen with the > > existing objects, > > > > > > The existing functions are horrible and not very useful. As this series > > demonstrates, we had something like 6 functions producing either: > > > > * a generator of changegroup chunks > > * a cg<N>unpacker instance > > > > from > > > > * an outgoing instance > > * a set of bases and heads > > * a set common nodes and heads > > > > sometimes with knobs to control the changegroup version. > > > > The API was unclear and far too low-level (e.g. sometimes you wanted to > > access the number of changesets or the nodes in the changegroup > > afterwards and the API hid these from you). To be honest, it wasn't > > clear to me that this was how things worked until I wrote the series. > > That's not a good sign. > > (Reminder: I'm aware the current code could use improvement) > > You new code seems to keep the 3 options for specifying the outgoing > set. So I assume you focussed on getting a simpler return type. Can you > elaborate on why do we need multiple return type in the first place? > Flexibility for callers. Now they can do things like obtain multiple changegroup version representations from the same type. Perhaps you want to write a server extension that stashes both a v1 and v2 bundle from pushed data. It's now much simpler to do that. > > Also, could we simplify the other side of this. For example we could > just keep the outgoing instance option to keep this layer simpler. > > > The new API solves makes thing simpler by allowing you to construct a > > stateful object to representing a generated changegroup and then get a > > changegroup representation various ways. It is much more explicit. > > "Stateful" seems to be an important key to lift my confusion. Can you > elaborate? > Basically an entity that lets you obtain a representation of a changegroup N ways without having to muck about low-level changegroup.py functions. > > > * many @class method usage while our coding style usually avoid them > > along other "class as namespace" usage (as per Matt taste), > > > > > > That's fine. These can be pulled out into module-level functions. e.g. > > emitterfromrootsandheads(repo, roots, heads) -> changegroupemitter > > > > > > > > * the > > > superlongnameforeverything.thatcombinewithclassnamspacing.foroverninethousand > > charline we could probably me more compact using a abbreviation for > > common parts as we do elsewhere in the code. > > > > > > I agree the names are a bit long. But I'm not a fan of abbreviations > > because they make readability hard, especially for those who don't speak > > English natively. > > We use common abbreviation in the Mercurial codebase in many place: rev, > ctx, cg, obs, etc… I'm non-native speaker and code well with that. We > also have a wiki page with some details on those. We could update it and > point people to it more. > > https://www.mercurial-scm.org/wiki/CodingStyle#Naming_conventions > > > I think remaining "changegroupemitter" to "emitter" (because > > "changegroup" is redundant with the module name) and moving the > > @classmethods to module functions will make things much better. e.g. > > > > emitter = changegroup.changegroupemitter.fromheadsandcommon(repo, heads, > common) > > > > emitter = changegroup.emitterfromheadsandcommon(repo, heads, common) > > I agree, that would be a good step forward > > > > > Or if wanted to write a complex __init__ function, we could have it > > accept all arguments. But I don't like complexity and prefer separate > > "constructor" functions. > > It seems like we have 4 possible arguments: outgoing, common, base, > heads. The complexity seems manageable. We could maybe have a factoring > function for outgoing working that way and just delegate to it in the > constructor. > > Cheers, > > -- > Pierre-Yves David >
On Aug 4, 2016, at 06:58, Kevin Bullock <kbullock+mercurial@ringworld.org> wrote: >> On Aug 3, 2016, at 22:53, Gregory Szorc <gregory.szorc@gmail.com> wrote: >> >>>> On Wed, Aug 3, 2016 at 1:56 AM, Pierre-Yves David <pierre-yves.david@ens-lyon.org> wrote: >>>> On 08/02/2016 04:50 AM, Gregory Szorc wrote: >>>> >>>> I think the end state of this series involves a "changegroup emitter" >>>> being fed into some kind of "bundler," preferably as a generator. Using >>>> the cg1unpacker for this purpose is kind of silly, as it necessitates a >>>> util.chunkbuffer. The good news is the "changegroupemitter" as it stands >>>> can easily be represented as both a generator of chunks or a file >>>> handle. So by introducing this object and requiring callers to use it, >>>> we can start transitioning callers to not use util.chunkbuffer without >>>> having to change APIs on changegroup.py. It's much more flexible. >>> This paragraph highlight what seems to be the most important thing to >>> address in this discussion. You seems to try to avoid touching the >>> existing changegroup API as much as possible and I do not understand >>> why. My general feeling is that it should be possible to upgrade the >>> existing class to fit your needs, but your seems to think otherwise. I'm >>> curious to find out why as you usually have pretty good reason for doing >>> things? >> I avoided changing the cg<N>packer and cg<N>unpacker classes because I didn't want to write a 40 part series :) >> >> The cg<N>unpacker class in particular is a bit gnarly because it performs multiple tasks: >> >> 1) Exposes a file API from a possibly compressed input file object (read, seek, tell, close) >> 2) Defines the composition of the changegroup framing protocol >> 3) Converts a file object to a generator of changegroup chunks >> 4) Applies a changegroup to a repo >> >> This class is doing too much and needs to be split. To be honest, I'm not sure how. I do know refactoring it will be a long series :) > > Sounds like we should come up with a strategy for this at the sprint, if it's not resolved by then. I've added it to the wiki page. I want to work this out prior, as it's related to some performance improvements I'd like to get into the next release. I think Pierre-Yves and I can find time to agree before October :)
On 08/04/2016 05:53 AM, Gregory Szorc wrote: > On Wed, Aug 3, 2016 at 1:56 AM, Pierre-Yves David > <pierre-yves.david@ens-lyon.org <mailto:pierre-yves.david@ens-lyon.org>> > wrote: > > On 08/02/2016 04:50 AM, Gregory Szorc wrote: > > On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David > > <pierre-yves.david@ens-lyon.org > <mailto:pierre-yves.david@ens-lyon.org> > <mailto:pierre-yves.david@ens-lyon.org > <mailto:pierre-yves.david@ens-lyon.org>>> > > wrote: > > > > On 08/01/2016 08:20 PM, Gregory Szorc wrote: > > > This is the first half of the series. The second half can be found at > > > https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if > > > someone wants to review the whole thing. > > > > I've made a first review pass on this series. As much as I like the idea > > of cleaning this up, the proposed implementation makes me uncomfortable. > > The main points that troubles me are: > > > > * the extra new class is a bit of a mix of (1) outgoing object (2) cg1 > > bundler object. This makes a the boundary of these object more fuzzy > > > > > > Not sure what you mean here. An outgoing object is basically a fancy > > tuple holding the set of common and missing nodes. > > My point here is that Outgoing is a class meant to represent this > information, across various user. We should aim at increasing it usage > and visibility instead of hiding it under custom API exposing part of > it. For example, the 'emitter.changesetcount' should probably just be > `emiter.outgoing` with outgoing exposed elsewhere (eg: pushoperation) > programmer will eventually just go "ho, an outgoing object, I know > that". > > > I agree that discovery.outgoing is a decent type to represent changesets > that will be "sent" to another location or will represent the contents > of a changegroup. I think most of the code currently in > changegroup.changegroupsubset() should move to the discovery module. I > *don't* think discovery.outgoing instances should gain APIs to emit > changegroups, however. Instead, outgoing instances are things to be > passed to a changegroup generator instructing it how to behave. We are on the same page here. the outgoing object is meant to be descriptive only. If might get more data to get more descriptive, but adding any generation to it should be avoided. > > I'm not sure what a "cg1 bundler" is. A non-bundle2 is just a v1 changegroup with a 6 byte > > header. > > > > If you are referring to returning a cg<N>unpacker instance, I'm not a > > huge fan of that. I only did that for parity with existing consumers > > because I didn't want to scope bloat to create a 40+ patch series. > > My point is that we already have class emiting changegroup data. I'm > aware this area is not great. And I trust you in you will to clean up > all this. However, it is currently unclear to me why we cannot improve > them to fit your needs instead of introducing a new class. > > > I wanted to establish a common class to represent "pending changegroup > data" so that consumers all operated on a common type and could access > changegroup data any way they saw fit *and* in an explicit manner.. I > view this as simpler than requiring clients to hold onto a reference to > sets of > nodes (or an outgoing instance) along with a generator or chunkbuffer. > The explicit part is important: the changegroup functions today have > very opaque return values. e.g. sometimes a "cg" variable refers to a > generator of chunks. Sometimes it is a chunkbuffer. By having a class > where you call "getX()" it is much easier to understand what the code is > doing. My main objection here is about adding yet another type of object. The changegroup generation stack is already quite complicated with at least three thing I can think on the top of my head (packer, unpacker, bundler). I do not see the addition of a new one as an improvement (unless we manage to kill another one). In the case you are mentionning, these iterators/chunkbuffer come from somewhere. We could probably improve the object that provide what their consumer needs. > > I do intend to do some follow-up work here because our handling of > > changegroup generation is just bonkers. For example, sending a bundle 2 > > to the wire protocol with the current code involves: > > > > 1. Construct a cg<N>packer and call generate() to obtain a generator of > > changegroup chunks > > 2. Store a reference to the generator on a bundle2 part > > 3. Convert generator to a util.chunkbuffer (a file-like object) as part > > of bundlepart._payloadchunks() when serializing the bundle2 part > > 4. Emit a generator of 4k chunks because that's what the > > bundlepart.getchunks() does > > 5. Feed this generator of chunks into a util.chunkbuffer holding the > > entire bundle2 data (in exchange.getbundle()) > > 6. Construct a zlib compressor > > 7. read(4096) from the util.chunkbuffer into the zlib compressor > > 8. Emit a generator of zlib compressed chunks to be sent to WSGI > > To make sure I'm clear, I'm not challenging you diagnostic that there is > improvement to be made. In your example. there is multiple things > unrelated to the changegroup generation. I'm usually a proponent of > improving existing layers instead of introducing a new ones (does not > mean you cannot convince me that you approach here is right). > > I argue I improved the changegroup API layer by consolidating 8 highly > redundant functions into a single new class that does everything they > did and more :) To me it sounded a bit like new class with 8 redundant methods and a task of existing object. I would like to be optimistic a believe we can consolidate the API without making more layers. > > util.chunkbuffer overhead has appeared in profiling results when cloning > > Firefox. The thing is that none of these util.chunkbuffer are really > > necessary for the wire protocol case: we could feed a generator of > > chunks of various sizes into zlib then emit whatever chunks it spits > > out, possibly normalizing them to equal sizes if we really wanted. > > Yep, there is a lot of small improvement we could do. As you point out > the step (5) (6) could probably be easily simplified. But this seems > unrelated to the current rework of changegroup emition. Is it? > > > It's difficult to do now because it isn't clear what each changegroup > function does or when it is appropriate to call. (reminder, I'm not challenging that the current API is confusing.) > > I think the end state of this series involves a "changegroup emitter" > > being fed into some kind of "bundler," preferably as a generator. Using > > the cg1unpacker for this purpose is kind of silly, as it necessitates a > > util.chunkbuffer. The good news is the "changegroupemitter" as it stands > > can easily be represented as both a generator of chunks or a file > > handle. So by introducing this object and requiring callers to use it, > > we can start transitioning callers to not use util.chunkbuffer without > > having to change APIs on changegroup.py. It's much more flexible. > > This paragraph highlight what seems to be the most important thing to > address in this discussion. You seems to try to avoid touching the > existing changegroup API as much as possible and I do not understand > why. My general feeling is that it should be possible to upgrade the > existing class to fit your needs, but your seems to think otherwise. I'm > curious to find out why as you usually have pretty good reason for doing > things? > > > I avoided changing the cg<N>packer and cg<N>unpacker classes because I > didn't want to write a 40 part series :) You are talking about 40 patches series as if it was something bad. I know you can do patch series that are x5 longer ;-) > The cg<N>unpacker class in particular is a bit gnarly because it > performs multiple tasks: > > 1) Exposes a file API from a possibly compressed input file object > (read, seek, tell, close) > 2) Defines the composition of the changegroup framing protocol > 3) Converts a file object to a generator of changegroup chunks > 4) Applies a changegroup to a repo > > This class is doing too much and needs to be split. To be honest, I'm > not sure how. I do know refactoring it will be a long series :) I think I'm starting to understant your issue here. I'll have to have closer look at what all these class are doing. (or we could discuss it if you have it all loaded in your head). > > and I not convinced yet your cleanup could not happen with the > > existing objects, > > > > > > The existing functions are horrible and not very useful. As this series > > demonstrates, we had something like 6 functions producing either: > > > > * a generator of changegroup chunks > > * a cg<N>unpacker instance > > > > from > > > > * an outgoing instance > > * a set of bases and heads > > * a set common nodes and heads > > > > sometimes with knobs to control the changegroup version. > > > > The API was unclear and far too low-level (e.g. sometimes you wanted to > > access the number of changesets or the nodes in the changegroup > > afterwards and the API hid these from you). To be honest, it wasn't > > clear to me that this was how things worked until I wrote the series. > > That's not a good sign. > > (Reminder: I'm aware the current code could use improvement) > > You new code seems to keep the 3 options for specifying the outgoing > set. So I assume you focussed on getting a simpler return type. Can you > elaborate on why do we need multiple return type in the first place? > > > Flexibility for callers. Now they can do things like obtain multiple > changegroup version representations from the same type. Perhaps you want > to write a server extension that stashes both a v1 and v2 bundle from > pushed data. It's now much simpler to do that. I believe it would be better to clean up the current code before adding more advance feature to it. We might end not not needing this in core if the API is nice enough Cheers,
Patch
diff --git a/mercurial/exchange.py b/mercurial/exchange.py --- a/mercurial/exchange.py +++ b/mercurial/exchange.py @@ -1547,39 +1547,42 @@ def getbundle(repo, source, heads=None, **kwargs) return util.chunkbuffer(bundler.getchunks()) @getbundle2partsgenerator('changegroup') def _getbundlechangegrouppart(bundler, repo, source, bundlecaps=None, b2caps=None, heads=None, common=None, **kwargs): """add a changegroup part to the requested bundle""" - cg = None - if kwargs.get('cg', True): - # build changegroup bundle here. - version = '01' - cgversions = b2caps.get('changegroup') - if cgversions: # 3.1 and 3.2 ship with an empty value - cgversions = [v for v in cgversions - if v in changegroup.supportedoutgoingversions(repo)] - if not cgversions: - raise ValueError(_('no common changegroup version')) - version = max(cgversions) - outgoing = changegroup.computeoutgoing(repo, heads, common) - cg = changegroup.getlocalchangegroupraw(repo, source, outgoing, - bundlecaps=bundlecaps, - version=version) + if not kwargs.get('cg', True): + return - if cg: - part = bundler.newpart('changegroup', data=cg) - if cgversions: - part.addparam('version', version) - part.addparam('nbchanges', str(len(outgoing.missing)), mandatory=False) - if 'treemanifest' in repo.requirements: - part.addparam('treemanifest', '1') + # build changegroup bundle here. + version = '01' + cgversions = b2caps.get('changegroup') + if cgversions: # 3.1 and 3.2 ship with an empty value + cgversions = [v for v in cgversions + if v in changegroup.supportedoutgoingversions(repo)] + if not cgversions: + raise ValueError(_('no common changegroup version')) + version = max(cgversions) + outgoing = changegroup.computeoutgoing(repo, heads, common) + cg = changegroup.getlocalchangegroupraw(repo, source, outgoing, + bundlecaps=bundlecaps, + version=version) + + if not cg: + return + + part = bundler.newpart('changegroup', data=cg) + if cgversions: + part.addparam('version', version) + part.addparam('nbchanges', str(len(outgoing.missing)), mandatory=False) + if 'treemanifest' in repo.requirements: + part.addparam('treemanifest', '1') @getbundle2partsgenerator('listkeys') def _getbundlelistkeysparts(bundler, repo, source, bundlecaps=None, b2caps=None, **kwargs): """add parts containing listkeys namespaces to the requested bundle""" listkeys = kwargs.get('listkeys', ()) for namespace in listkeys: part = bundler.newpart('listkeys')