Submitter | phabricator |
---|---|
Date | March 4, 2018, 4:58 p.m. |
Message ID | <differential-rev-PHID-DREV-s6t7b2iqrhk5vojsstz6-req@phab.mercurial-scm.org> |
Download | mbox | patch |
Permalink | /patch/28935/ |
State | Superseded |
Headers | show |
Comments
indygreg requested changes to this revision. indygreg added a comment. This revision now requires changes to proceed. The commit message looks incomplete? We really want a change like this to be documented. Could you please write more in the commit message and inline? The docstring of `_takequicksample()` is also now inaccurate. Also, I'm not convinced this change is correct. Should we connect IRL? REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D2647 To: martinvonz, #hg-reviewers, indygreg Cc: indygreg, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D2647#42980, @indygreg wrote: > The commit message looks incomplete? > > We really want a change like this to be documented. Could you please write more in the commit message and inline? The docstring of `_takequicksample()` is also now inaccurate. > > Also, I'm not convinced this change is correct. Should we connect IRL? Heh, I sent this by mistake. I think it should be functionally correct, but we can optimize better. I'll drop this patch. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D2647 To: martinvonz, #hg-reviewers, indygreg Cc: indygreg, mercurial-devel
marmoute added a comment. I feel like I am missing something. Your commit message seems to be talking using at least as many item in the sameple than there is independant connected set. However your code seems to use "heads(undecided)" that is a quite different. Using independant connected set seems like a good trade off (but might be expensive to compute). Using all heads can significantly bloat the discovery without giving it a significant edge in many cases. Can you clarify this ? INLINE COMMENTS > setdiscovery.py:113 > > - def __init__(self, repo, targetheads): > + def __init__(self, repo, targetheads, limitedarguments): > self._repo = repo Calling this "limited argument" seems like wireprotocol details leaking into more abstract discovery logic. Can we give it a different name ? (maybe something like "maxsize=200" or "extensiblesample=False") ? REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D2647 To: martinvonz, #hg-reviewers, indygreg, marmoute Cc: indygreg, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D2647#93256, @marmoute wrote: > I feel like I am missing something. Your commit message seems to be talking using at least as many item in the sameple than there is independant connected set. However your code seems to use "heads(undecided)" that is a quite different. Using independant connected set seems like a good trade off (but might be expensive to compute). Using all heads can significantly bloat the discovery without giving it a significant edge in many cases. Good point. The case I can think of is when you have a tree of commits on the local side. Something like this: o | o | | o | | | o | | |/ | |/ | o | | | o | | | o |/ o ~ If we have a long sequence of commits there and many heads, we would increase the sampling of the (mostly-)linear part unnecessarily. I'll see if there's an easy way to improve that. > Can you clarify this ? INLINE COMMENTS > marmoute wrote in setdiscovery.py:113 > Calling this "limited argument" seems like wireprotocol details leaking into more abstract discovery logic. Can we give it a different name ? (maybe something like "maxsize=200" or "extensiblesample=False") ? Good point. Done. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D2647 To: martinvonz, #hg-reviewers, indygreg, marmoute Cc: indygreg, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D2647#93258, @martinvonz wrote: > In https://phab.mercurial-scm.org/D2647#93256, @marmoute wrote: > > > I feel like I am missing something. Your commit message seems to be talking using at least as many item in the sameple than there is independant connected set. However your code seems to use "heads(undecided)" that is a quite different. Using independant connected set seems like a good trade off (but might be expensive to compute). Using all heads can significantly bloat the discovery without giving it a significant edge in many cases. > > > Good point. The case I can think of is when you have a tree of commits on the local side. Something like this: > > o > | o > | | o > | | | o > | | |/ > | |/ > | o > | | > | o > | | > | o > |/ > o > ~ > > > If we have a long sequence of commits there and many heads, we would increase the sampling of the (mostly-)linear part unnecessarily. I'll see if there's an easy way to improve that. I think it was as easy as changing from using number of heads to using number of roots. Do you think there are still cases we'd handle poorly? I think there are things we can improve in the size-limited case too (it seems like it should be better to include certain points, like the mid-point, than to pick randomly), but that's a bigger task that I'm not willing to start working on now. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D2647 To: martinvonz, #hg-reviewers, indygreg, marmoute Cc: indygreg, mercurial-devel
marmoute added a comment. We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D2647 To: martinvonz, #hg-reviewers, indygreg, marmoute Cc: indygreg, mercurial-devel
martinvonz added a comment.
In https://phab.mercurial-scm.org/D2647#93456, @marmoute wrote:
> We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common.
The current way (only consider roots) should not over-sample, right? It still seems very effective in practice.
REPOSITORY
rHG Mercurial
REVISION DETAIL
https://phab.mercurial-scm.org/D2647
To: martinvonz, #hg-reviewers, indygreg, marmoute
Cc: indygreg, mercurial-devel
gracinet added a comment. In https://phab.mercurial-scm.org/D2647#93457, @martinvonz wrote: > In https://phab.mercurial-scm.org/D2647#93456, @marmoute wrote: > > > We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common. > > > The current way (only consider roots) should not over-sample, right? It still seems very effective in practice. I suppose it will if you have thousands of roots converging to one point, then that point diverging again towards thousands of heads, all of that actually missing from the remote, but anyway, the current random wouldn't catch that either. Anyway, basing on the number of roots is biased towards cases where lots of independents branches are missing from the remote, like in the use case from the commit message, whereas basing on the number of heads would be biased towards lots of independent branches being common (while not being remote heads, since we know them already, hence, cases where they have been merged in the remote). I imagine the latter being quite common as well. I would be in favor of something based on both. Still I start feeling like we should be a bit cautious with the impact on all of this on network time and remote servers CPU time in practical cases (although the 'known' question is quite fast). At which threshold would an additional roundtrip be actually faster for everyone ? REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D2647 To: martinvonz, #hg-reviewers, indygreg, marmoute Cc: gracinet, indygreg, mercurial-devel
marmoute added a comment. Something only based on the number of root can also over sample. For a "simple" example, imagine a undecideded set with many roots that eventually all merge into a few heads. If most of that set is common between local and remote, few question about the part of the history at the heads will quickly "decide" many changesets. Numberous questions at the roots part of the history won't. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D2647 To: martinvonz, #hg-reviewers, indygreg, marmoute Cc: gracinet, indygreg, mercurial-devel
gracinet added a comment. Also I'd like to mention that I have changesets that add more precise timing info, both for the Python variant (this one) and the Rust variant. These are the ones I used to put timing info in https://phab.mercurial-scm.org/D6430 and https://phab.mercurial-scm.org/D6428. I can share them, but I wouldn't submit them. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D2647 To: martinvonz, #hg-reviewers, indygreg, marmoute Cc: gracinet, indygreg, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D2647#93458, @gracinet wrote: > In https://phab.mercurial-scm.org/D2647#93457, @martinvonz wrote: > > > In https://phab.mercurial-scm.org/D2647#93456, @marmoute wrote: > > > > > We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common. > > > > > > The current way (only consider roots) should not over-sample, right? It still seems very effective in practice. > > > I suppose it will if you have thousands of roots converging to one point, then that point diverging again towards thousands of heads, all of that actually missing from the remote, but anyway, the current random wouldn't catch that either. Right, I also thought of that case, but it just seemed too obscure to worry about. > Anyway, basing on the number of roots is biased towards cases where lots of independents branches are missing from the remote, like in the use case from the commit message, whereas basing on the number of heads would be biased towards lots of independent branches being common (while not being remote heads, since we know them already, hence, cases where they have been merged in the remote). I imagine the latter being quite common as well. > > I would be in favor of something based on both. I'm happy with what we get for this simple change, but feel free to send a followup. > Still I start feeling like we should be a bit cautious with the impact on all of this on network time and remote servers CPU time in practical cases (although the 'known' question is quite fast). At which threshold would an additional roundtrip be actually faster for everyone ? I agree with you in principle. Roundtrips also cost a bit in terms of CPU and bandwidth, so it's probably not a big concern? REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D2647 To: martinvonz, #hg-reviewers, indygreg, marmoute Cc: gracinet, indygreg, mercurial-devel
Patch
diff --git a/mercurial/setdiscovery.py b/mercurial/setdiscovery.py --- a/mercurial/setdiscovery.py +++ b/mercurial/setdiscovery.py @@ -107,12 +107,16 @@ :size: the maximum size of the sample""" sample = dag.headsetofconnecteds(nodes) if len(sample) >= size: - return _limitsample(sample, size) + # Return full set of heads, without limiting + return sample _updatesample(dag, None, sample, quicksamplesize=size) return sample def _takefullsample(dag, nodes, size): sample = dag.headsetofconnecteds(nodes) + if len(sample) >= size: + # Return full set of heads, without limiting + return sample # update from heads _updatesample(dag, nodes, sample) # update from roots