Patchwork [3,of,3,RFC] setdiscovery.findcommonheads: examine ancestors to avoid discovery

login
register
mail settings
Submitter Gregory Szorc
Date Nov. 20, 2014, 6:05 p.m.
Message ID <6dc58916e7c070f67868.1416506744@3.1.168.192.in-addr.arpa>
Download mbox | patch
Permalink /patch/6811/
State Deferred
Headers show

Comments

Gregory Szorc - Nov. 20, 2014, 6:05 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1416505368 28800
#      Thu Nov 20 09:42:48 2014 -0800
# Node ID 6dc58916e7c070f678682bfe404d2e2d68291a18
# Parent  6e7ea082099e457b048f268751d345392653e97e
setdiscovery.findcommonheads: examine ancestors to avoid discovery

THIS PATCH IS UNTESTED AND IS MEANT FOR CONSIDERATION ONLY.

Discovery can be a very expensive operation when the local repository
has many of heads. For example, my Firefox repository with hundreds of
Mozilla's release branches and my unfinished bookmarks/heads has ~1000
heads. When using `hg pull -r <node>` to fetch a specific head from a
remote, this can take 20-40 server roundtrips for "known" lookups.
This adds considerable latency to pulling.

This patch implements an alternative mechanism for discovery in the case
where the client knows which nodes it wants from the server. Instead of
the client asking the server whether it knows about each and every
client head (which gets more expensive as the number of client heads
is increased), the client instead asks the server for the ancestors
of the wanted head. The first ancestor that is locally known is used
as the common head.

The cost of tracing ancestry is proportional to how far the local and
remote heads have "diverged." If the client is relatively up to date,
the common ancestor should be quickly discovered. If the client is
thousands of commits "behind," we'll need to traverse more ancestors
until we find a match. This means more data transfer from server to
client.

This patch implements ancestor based discovery in the most simple manner
possible:

1) We must only be requesting a single remote node
2) The numbr of local heads were require multiple round trips
3) The common ancestor must be discovered in the first N ancestors, or
   we fall back to querying the known flag for each local head.

The ancestry-based discovery approach should scale better for clients
with many heads. This especially holds true where clients have
relatively little upload bandwidth compared to download bandwidth (which
is often the case on the public Internet). With ancestry-based
discovery, the client sends a very small request to the server and the
server is sending down many 40 byte SHA-1s. This is the opposite of
how "known" works, where clients send many 40 byte SHA-1s and the server
sends 1 byte per SHA-1 in the response.

One could imagine the ancestry-based approach being far more robust. For
example, the server command could stream down the full ancestors list
until a match is discovered and the client says "stop." But this would
require more robust bi-direction protocol primitives than what we have
now.
Matt Mackall - Dec. 2, 2014, 10:35 p.m.
On Thu, 2014-11-20 at 10:05 -0800, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1416505368 28800
> #      Thu Nov 20 09:42:48 2014 -0800
> # Node ID 6dc58916e7c070f678682bfe404d2e2d68291a18
> # Parent  6e7ea082099e457b048f268751d345392653e97e
> setdiscovery.findcommonheads: examine ancestors to avoid discovery
> 
> THIS PATCH IS UNTESTED AND IS MEANT FOR CONSIDERATION ONLY.

Sadly, I don't know enough about the current discovery algorithm to say
much without spending a few hours building context. But:

> Instead of
> the client asking the server whether it knows about each and every
> client head (which gets more expensive as the number of client heads
> is increased), the client instead asks the server for the ancestors
> of the wanted head.

This sounds a lot like the "between" method in the original discovery
protocol. So you can probably implement your new protocol without adding
any new methods.

Patch

diff --git a/mercurial/setdiscovery.py b/mercurial/setdiscovery.py
--- a/mercurial/setdiscovery.py
+++ b/mercurial/setdiscovery.py
@@ -143,26 +143,55 @@  def findcommonheads(ui, local, remote,
     if remote.local():
         # stopgap until we have a proper localpeer that supports batch()
         srvheadhashes = remote.heads()
         yesno = remote.known(dag.externalizeall(sample))
+        ancestors = None
     elif remote.capable('batch'):
         batch = remote.batch()
         srvheadhashesref = batch.heads()
+
+        # If we are only looking for certain nodes, there is a possibly
+        # faster method for finding a common node: look at the remote
+        # ancestors and cross reference with locally-known nodes. If the
+        # remote hasn't seen much activity since we last pulled its head
+        # we'll find a match, and we can avoid a potentially costly
+        # discovery of every single local head.
+        #
+        # For now, we only engage this mode if looking for single nodes
+        # *and* our local head count would require additional round trips.
+        singlewanted = wantednodes and len(wantednodes) == 1
+        anscapable = remote.capable('ancestors')
+        if singlewanted and len(sample) < len(ownheads) and anscapable:
+            ancestorsref = batch.ancestors(wantednodes, 400)
+        else:
+            ancestorsref = None
+
         yesnoref = batch.known(dag.externalizeall(sample))
         batch.submit()
         srvheadhashes = srvheadhashesref.value
+        if ancestorsref:
+            ancestors = ancestorsref.value
+        else:
+            ancestors = None
         yesno = yesnoref.value
     else:
         # compatibility with pre-batch, but post-known remotes during 1.9
         # development
         srvheadhashes = remote.heads()
+        ancestors = None
         sample = []
 
     if cl.tip() == nullid:
         if srvheadhashes != [nullid]:
             return [nullid], True, srvheadhashes
         return [nullid], False, []
 
+    # Try to find a quick match from ancestors, if possible.
+    if ancestors:
+        for node in ancestors:
+            if node in repo:
+                return [repo[node].node(), True, wantednodes]
+
     # start actual discovery (we note this before the next "if" for
     # compatibility reasons)
     ui.status(_("searching for changes\n"))