Patchwork D2647: setdiscovery: include all local heads in second "known" request (issue5809)

login
register
mail settings
Submitter phabricator
Date March 4, 2018, 4:58 p.m.
Message ID <differential-rev-PHID-DREV-s6t7b2iqrhk5vojsstz6-req@phab.mercurial-scm.org>
Download mbox | patch
Permalink /patch/28935/
State Superseded
Headers show

Comments

phabricator - March 4, 2018, 4:58 p.m.
martinvonz created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  During discovery, when figuring out which of the local commits the
  remote has, we start by sending a random sample of

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

AFFECTED FILES
  mercurial/setdiscovery.py

CHANGE DETAILS




To: martinvonz, #hg-reviewers
Cc: mercurial-devel
phabricator - March 4, 2018, 7:28 p.m.
indygreg requested changes to this revision.
indygreg added a comment.
This revision now requires changes to proceed.


  The commit message looks incomplete?
  
  We really want a change like this to be documented. Could you please write more in the commit message and inline? The docstring of `_takequicksample()` is also now inaccurate.
  
  Also, I'm not convinced this change is correct. Should we connect IRL?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg
Cc: indygreg, mercurial-devel
phabricator - March 4, 2018, 7:44 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D2647#42980, @indygreg wrote:
  
  > The commit message looks incomplete?
  >
  > We really want a change like this to be documented. Could you please write more in the commit message and inline? The docstring of `_takequicksample()` is also now inaccurate.
  >
  > Also, I'm not convinced this change is correct. Should we connect IRL?
  
  
  Heh, I sent this by mistake. I think it should be functionally correct, but we can optimize better. I'll drop this patch.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg
Cc: indygreg, mercurial-devel
phabricator - May 21, 2019, 8:17 p.m.
marmoute added a comment.


  I feel like I am missing something. Your commit message seems to be talking using at least as many item in the sameple than there is independant connected set. However your code seems to use "heads(undecided)" that is a quite different. Using independant connected set seems like a good trade off (but might be expensive to compute). Using all heads can significantly bloat the discovery without giving it a significant edge in many cases.
  
  Can you clarify this ?

INLINE COMMENTS

> setdiscovery.py:113
>  
> -    def __init__(self, repo, targetheads):
> +    def __init__(self, repo, targetheads, limitedarguments):
>          self._repo = repo

Calling this "limited argument" seems like wireprotocol details leaking into more abstract discovery logic. Can we give it a different name ? (maybe something like "maxsize=200" or "extensiblesample=False") ?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg, marmoute
Cc: indygreg, mercurial-devel
phabricator - May 21, 2019, 9:05 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D2647#93256, @marmoute wrote:
  
  > I feel like I am missing something. Your commit message seems to be talking using at least as many item in the sameple than there is independant connected set. However your code seems to use "heads(undecided)" that is a quite different. Using independant connected set seems like a good trade off (but might be expensive to compute). Using all heads can significantly bloat the discovery without giving it a significant edge in many cases.
  
  
  Good point. The case I can think of is when you have a tree of commits on the local side. Something like this:
  
    o
    | o
    | | o
    | | | o
    | | |/
    | |/
    | o
    | |
    | o
    | |
    | o
    |/
    o
    ~
  
  If we have a long sequence of commits there and many heads, we would increase the sampling of the (mostly-)linear part unnecessarily. I'll see if there's an easy way to improve that.
  
  > Can you clarify this ?

INLINE COMMENTS

> marmoute wrote in setdiscovery.py:113
> Calling this "limited argument" seems like wireprotocol details leaking into more abstract discovery logic. Can we give it a different name ? (maybe something like "maxsize=200" or "extensiblesample=False") ?

Good point. Done.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg, marmoute
Cc: indygreg, mercurial-devel
phabricator - May 22, 2019, 4:07 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D2647#93258, @martinvonz wrote:
  
  > In https://phab.mercurial-scm.org/D2647#93256, @marmoute wrote:
  >
  > > I feel like I am missing something. Your commit message seems to be talking using at least as many item in the sameple than there is independant connected set. However your code seems to use "heads(undecided)" that is a quite different. Using independant connected set seems like a good trade off (but might be expensive to compute). Using all heads can significantly bloat the discovery without giving it a significant edge in many cases.
  >
  >
  > Good point. The case I can think of is when you have a tree of commits on the local side. Something like this:
  >
  >   o
  >   | o
  >   | | o
  >   | | | o
  >   | | |/
  >   | |/
  >   | o
  >   | |
  >   | o
  >   | |
  >   | o
  >   |/
  >   o
  >   ~
  >   
  >
  > If we have a long sequence of commits there and many heads, we would increase the sampling of the (mostly-)linear part unnecessarily. I'll see if there's an easy way to improve that.
  
  
  I think it was as easy as changing from using number of heads to using number of roots. Do you think there are still cases we'd handle poorly? I think there are things we can improve in the size-limited case too (it seems like it should be better to include certain points, like the mid-point, than to pick randomly), but that's a bigger task that I'm not willing to start working on now.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg, marmoute
Cc: indygreg, mercurial-devel
phabricator - May 22, 2019, 5:41 p.m.
marmoute added a comment.


  We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg, marmoute
Cc: indygreg, mercurial-devel
phabricator - May 22, 2019, 5:42 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D2647#93456, @marmoute wrote:
  
  > We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common.
  
  
  The current way (only consider roots) should not over-sample, right? It still seems very effective in practice.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg, marmoute
Cc: indygreg, mercurial-devel
phabricator - May 22, 2019, 6:13 p.m.
gracinet added a comment.


  In https://phab.mercurial-scm.org/D2647#93457, @martinvonz wrote:
  
  > In https://phab.mercurial-scm.org/D2647#93456, @marmoute wrote:
  >
  > > We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common.
  >
  >
  > The current way (only consider roots) should not over-sample, right? It still seems very effective in practice.
  
  
  I suppose it will if you have thousands of roots converging to one point, then that point diverging again towards thousands of heads, all of that actually missing from the remote, but anyway, the current random wouldn't catch that either. 
  Anyway, basing on the number of roots is biased towards cases where lots of independents branches are missing from the remote, like in the use case from the commit message, whereas basing on the number of heads would be biased towards lots of independent branches being common (while not being remote heads,  since we know them already, hence, cases where they have been merged in the remote). I imagine the latter being quite common as well.
  
  I would be in favor of something based on both. Still I start feeling like we should be a bit cautious with the impact on all of this on network time and remote servers CPU time in practical cases (although the 'known' question is quite fast).  At which threshold would an additional roundtrip be actually faster for everyone ?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg, marmoute
Cc: gracinet, indygreg, mercurial-devel
phabricator - May 22, 2019, 6:14 p.m.
marmoute added a comment.


  Something only based on the number of root can also over sample. For a "simple" example, imagine a undecideded set with many roots that eventually all merge into a  few heads.
  If most of that set is common between local and remote, few question about the part of the history at the heads will quickly "decide" many changesets. Numberous questions at the roots part of the history won't.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg, marmoute
Cc: gracinet, indygreg, mercurial-devel
phabricator - May 22, 2019, 6:16 p.m.
gracinet added a comment.


  Also I'd like to mention that I have changesets that add more precise timing info, both for the Python variant (this one) and the Rust variant. These are the ones I used to put timing info in https://phab.mercurial-scm.org/D6430 and https://phab.mercurial-scm.org/D6428. I can share them, but I wouldn't submit them.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg, marmoute
Cc: gracinet, indygreg, mercurial-devel
phabricator - May 22, 2019, 6:29 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D2647#93458, @gracinet wrote:
  
  > In https://phab.mercurial-scm.org/D2647#93457, @martinvonz wrote:
  >
  > > In https://phab.mercurial-scm.org/D2647#93456, @marmoute wrote:
  > >
  > > > We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common.
  > >
  > >
  > > The current way (only consider roots) should not over-sample, right? It still seems very effective in practice.
  >
  >
  > I suppose it will if you have thousands of roots converging to one point, then that point diverging again towards thousands of heads, all of that actually missing from the remote, but anyway, the current random wouldn't catch that either.
  
  
  Right, I also thought of that case, but it just seemed too obscure to worry about.
  
  > Anyway, basing on the number of roots is biased towards cases where lots of independents branches are missing from the remote, like in the use case from the commit message, whereas basing on the number of heads would be biased towards lots of independent branches being common (while not being remote heads,  since we know them already, hence, cases where they have been merged in the remote). I imagine the latter being quite common as well.
  > 
  > I would be in favor of something based on both.
  
  I'm happy with what we get for this simple change, but feel free to send a followup.
  
  > Still I start feeling like we should be a bit cautious with the impact on all of this on network time and remote servers CPU time in practical cases (although the 'known' question is quite fast).  At which threshold would an additional roundtrip be actually faster for everyone ?
  
  I agree with you in principle. Roundtrips also cost a bit in terms of CPU and bandwidth, so it's probably not a big concern?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2647

To: martinvonz, #hg-reviewers, indygreg, marmoute
Cc: gracinet, indygreg, mercurial-devel

Patch

diff --git a/mercurial/setdiscovery.py b/mercurial/setdiscovery.py
--- a/mercurial/setdiscovery.py
+++ b/mercurial/setdiscovery.py
@@ -107,12 +107,16 @@ 
     :size: the maximum size of the sample"""
     sample = dag.headsetofconnecteds(nodes)
     if len(sample) >= size:
-        return _limitsample(sample, size)
+        # Return full set of heads, without limiting
+        return sample
     _updatesample(dag, None, sample, quicksamplesize=size)
     return sample
 
 def _takefullsample(dag, nodes, size):
     sample = dag.headsetofconnecteds(nodes)
+    if len(sample) >= size:
+        # Return full set of heads, without limiting
+        return sample
     # update from heads
     _updatesample(dag, nodes, sample)
     # update from roots