Patchwork setdiscovery: document algorithms used

login
register
mail settings
Submitter Olle Lundberg
Date March 6, 2014, 11:40 a.m.
Message ID <47a9873df1d7a0d68e6a.1394106047@SE-C02KQ0DADR55>
Download mbox | patch
Permalink /patch/3876/
State Accepted
Commit cdecbc5ab5044595ca2d820a1b4933b5ec77c248
Headers show

Comments

Olle Lundberg - March 6, 2014, 11:40 a.m.
# HG changeset patch
# User Olle Lundberg <geek@nerd.sh>
# Date 1394105848 -3600
#      Thu Mar 06 12:37:28 2014 +0100
# Node ID 47a9873df1d7a0d68e6ac45ca284fc835bfa818d
# Parent  c8dd203dc7ce803828eea9c3a9b3590a41e94ca0
setdiscovery: document algorithms used

This is taken from:
http://programmers.stackexchange.com/questions/208998
And modified slightly.
Matt Mackall - March 6, 2014, 10:19 p.m.
On Thu, 2014-03-06 at 12:40 +0100, Olle Lundberg wrote:
> # HG changeset patch
> # User Olle Lundberg <geek@nerd.sh>
> # Date 1394105848 -3600
> #      Thu Mar 06 12:37:28 2014 +0100
> # Node ID 47a9873df1d7a0d68e6ac45ca284fc835bfa818d
> # Parent  c8dd203dc7ce803828eea9c3a9b3590a41e94ca0
> setdiscovery: document algorithms used

Sure. Queued for default, thanks.

Patch

diff --git a/mercurial/setdiscovery.py b/mercurial/setdiscovery.py
--- a/mercurial/setdiscovery.py
+++ b/mercurial/setdiscovery.py
@@ -3,10 +3,44 @@ 
 # Copyright 2010 Benoit Boissinot <bboissin@gmail.com>
 # and Peter Arrenbrecht <peter@arrenbrecht.ch>
 #
 # This software may be used and distributed according to the terms of the
 # GNU General Public License version 2 or any later version.
+"""
+Algorithm works in the following way. You have two repository: local and
+remote. They both contains a DAG of changelists.
+
+The goal of the discovery protocol is to find one set of node *common*,
+the set of nodes shared by local and remote.
+
+One of the issue with the original protocol was latency, it could
+potentially require lots of roundtrips to discover that the local repo was a
+subset of remote (which is a very common case, you usually have few changes
+compared to upstream, while upstream probably had lots of development).
+
+The new protocol only requires one interface for the remote repo: `known()`,
+which given a set of changelists tells you if they are present in the DAG.
+
+The algorithm then works as follow:
+
+ - We will be using three sets, `common`, `missing`, `unknown`. Originally
+ all nodes are in `unknown`.
+ - Take a sample from `unknown`, call `remote.known(sample)`
+   - For each node that remote knows, move it and all its ancestors to `common`
+   - For each node that remote doesn't know, move it and all its descendants
+   to `missing`
+ - Iterate until `unknown` is empty
+
+There are a couple optimizations, first is instead of starting with a random
+sample of missing, start by sending all heads, in the case where the local
+repo is a subset, you computed the answer in one round trip.
+
+Then you can do something similar to the bisecting strategy used when
+finding faulty changesets. Instead of random samples, you can try picking
+nodes that will maximize the number of nodes that will be
+classified with it (since all ancestors or descendants will be marked as well).
+"""
 
 from node import nullid
 from i18n import _
 import random
 import util, dagutil