Patchwork [2,of,4] parsers: write dirstate starting with non-normal entries

login
register
mail settings
Submitter Laurent Charignon
Date Nov. 25, 2015, 1:19 a.m.
Message ID <ea9d03d4e85ea3949bb8.1448414349@dev5073.prn1.facebook.com>
Download mbox | patch
Permalink /patch/11644/
State Superseded
Delegated to: Yuya Nishihara
Headers show

Comments

Laurent Charignon - Nov. 25, 2015, 1:19 a.m.
# HG changeset patch
# User Laurent Charignon <lcharignon@fb.com>
# Date 1448413597 28800
#      Tue Nov 24 17:06:37 2015 -0800
# Node ID ea9d03d4e85ea3949bb8d16bd9e1a80246a8247b
# Parent  3bd86861a1618aabe6ec7f2cde1223282f9569be
parsers: write dirstate starting with non-normal entries

Before this patch we were writing the dirstate entries in a "random" way,
following the *unstable* order of a Python dictionary. This patch changes the
order in which we write the dirstate entries.

We now start with the non-normal files (that have changed and likely to have
changed) and end with the normal files. This makes the job of hg status easier
as, in most cases, it will need to access the non-normal entries of the
dirstate. This new ordering allows hg status to stop iterating over the dirstate
after processing those entries.

On our large repos, for hg status, we achieve a 40% improvement.
On the same repo, the cost of this change is a slowdown for writing the
dirstate to disk (as we do two passes). I measured the execution time of
hg debugrebuilddirstate with and without the change and observed a 5% slowdown
for the overall command.
Yuya Nishihara - Nov. 25, 2015, 2:24 p.m.
On Tue, 24 Nov 2015 17:19:09 -0800, Laurent Charignon wrote:
> # HG changeset patch
> # User Laurent Charignon <lcharignon@fb.com>
> # Date 1448413597 28800
> #      Tue Nov 24 17:06:37 2015 -0800
> # Node ID ea9d03d4e85ea3949bb8d16bd9e1a80246a8247b
> # Parent  3bd86861a1618aabe6ec7f2cde1223282f9569be
> parsers: write dirstate starting with non-normal entries
> 
> Before this patch we were writing the dirstate entries in a "random" way,
> following the *unstable* order of a Python dictionary. This patch changes the
> order in which we write the dirstate entries.
> 
> We now start with the non-normal files (that have changed and likely to have
> changed) and end with the normal files. This makes the job of hg status easier
> as, in most cases, it will need to access the non-normal entries of the
> dirstate. This new ordering allows hg status to stop iterating over the dirstate
> after processing those entries.
> 
> On our large repos, for hg status, we achieve a 40% improvement.
> On the same repo, the cost of this change is a slowdown for writing the
> dirstate to disk (as we do two passes). I measured the execution time of
> hg debugrebuilddirstate with and without the change and observed a 5% slowdown
> for the overall command.
> 
> diff --git a/mercurial/parsers.c b/mercurial/parsers.c
> --- a/mercurial/parsers.c
> +++ b/mercurial/parsers.c
> @@ -602,7 +602,10 @@
>  	}
>  	memcpy(p, s, l);
>  	p += 20;
> -	if (0 == 0) {
> +	int pass;

New variable can't be declared here according to C89, and MSVC does complain it.

> +	/* First pass, non normal files, second pass normal files. This is to improve
> +	 * status performance as status generally only need the non normal files */
> +	for (pass = 0; pass <= 1; pass++) {
>  		for (pos = 0; PyDict_Next(map, &pos, &k, &v); ) {
>  			dirstateTupleObject *tuple;
>  			char state;
> @@ -610,6 +613,7 @@
>  			Py_ssize_t len, l;
>  			PyObject *o;
>  			char *t;
> +			int normal;
>  
>  			if (!dirstate_tuple_check(v)) {
>  				PyErr_SetString(PyExc_TypeError,
> @@ -622,6 +626,9 @@
>  			mode = tuple->mode;
>  			size = tuple->size;
>  			mtime = tuple->mtime;
> +			normal = (state == 'n' && mtime != -1);
> +			if (normal != pass)
> +				continue;

As we have the pass to figure out memory size before the "first pass", maybe
we can know the offset to the first "normal" file beforehand. Then, we won't
need the "second pass".
Durham Goode - Nov. 25, 2015, 8:27 p.m.
On 11/24/15 8:19 PM, Laurent Charignon wrote:
> # HG changeset patch
> # User Laurent Charignon <lcharignon@fb.com>
> # Date 1448413597 28800
> #      Tue Nov 24 17:06:37 2015 -0800
> # Node ID ea9d03d4e85ea3949bb8d16bd9e1a80246a8247b
> # Parent  3bd86861a1618aabe6ec7f2cde1223282f9569be
> parsers: write dirstate starting with non-normal entries
>
> Before this patch we were writing the dirstate entries in a "random" way,
> following the *unstable* order of a Python dictionary. This patch changes the
> order in which we write the dirstate entries.
>
> We now start with the non-normal files (that have changed and likely to have
> changed) and end with the normal files. This makes the job of hg status easier
> as, in most cases, it will need to access the non-normal entries of the
> dirstate. This new ordering allows hg status to stop iterating over the dirstate
> after processing those entries.
>
> On our large repos, for hg status, we achieve a 40% improvement.
> On the same repo, the cost of this change is a slowdown for writing the
> dirstate to disk (as we do two passes). I measured the execution time of
> hg debugrebuilddirstate with and without the change and observed a 5% slowdown
> for the overall command.
>
Is 'hg debugrebuilddirstate' a good benchmark?  What is the performance 
difference for 'hg commit' or 'hg up .^'.  Also, what is the actual 
number behind 5%?  a 500ms slow down would be worriesome, while a 20ms 
slow down is ok.
Laurent Charignon - Nov. 30, 2015, 10:16 p.m.
> On Nov 25, 2015, at 12:27 PM, Durham Goode <durham@fb.com> wrote:
> 
> 
> 
> On 11/24/15 8:19 PM, Laurent Charignon wrote:
>> # HG changeset patch
>> # User Laurent Charignon <lcharignon@fb.com>
>> # Date 1448413597 28800
>> #      Tue Nov 24 17:06:37 2015 -0800
>> # Node ID ea9d03d4e85ea3949bb8d16bd9e1a80246a8247b
>> # Parent  3bd86861a1618aabe6ec7f2cde1223282f9569be
>> parsers: write dirstate starting with non-normal entries
>> 
>> Before this patch we were writing the dirstate entries in a "random" way,
>> following the *unstable* order of a Python dictionary. This patch changes the
>> order in which we write the dirstate entries.
>> 
>> We now start with the non-normal files (that have changed and likely to have
>> changed) and end with the normal files. This makes the job of hg status easier
>> as, in most cases, it will need to access the non-normal entries of the
>> dirstate. This new ordering allows hg status to stop iterating over the dirstate
>> after processing those entries.
>> 
>> On our large repos, for hg status, we achieve a 40% improvement.
>> On the same repo, the cost of this change is a slowdown for writing the
>> dirstate to disk (as we do two passes). I measured the execution time of
>> hg debugrebuilddirstate with and without the change and observed a 5% slowdown
>> for the overall command.
>> 
> Is 'hg debugrebuilddirstate' a good benchmark?  What is the performance difference for 'hg commit' or 'hg up .^'.  Also, what is the actual number behind 5%?  a 500ms slow down would be worriesome, while a 20ms slow down is ok.


What else do you suggest to use?
With hg debugrebuilddirstate, on one of our big repos, on average I measure a 16ms difference: 3.2s execution without the patch vs 3.32s with the patch.
Durham Goode - Nov. 30, 2015, 10:48 p.m.
On 11/30/15 5:16 PM, Laurent Charignon wrote:
>> On Nov 25, 2015, at 12:27 PM, Durham Goode <durham@fb.com> wrote:
>>
>>
>>
>> On 11/24/15 8:19 PM, Laurent Charignon wrote:
>>> # HG changeset patch
>>> # User Laurent Charignon <lcharignon@fb.com>
>>> # Date 1448413597 28800
>>> #      Tue Nov 24 17:06:37 2015 -0800
>>> # Node ID ea9d03d4e85ea3949bb8d16bd9e1a80246a8247b
>>> # Parent  3bd86861a1618aabe6ec7f2cde1223282f9569be
>>> parsers: write dirstate starting with non-normal entries
>>>
>>> Before this patch we were writing the dirstate entries in a "random" way,
>>> following the *unstable* order of a Python dictionary. This patch changes the
>>> order in which we write the dirstate entries.
>>>
>>> We now start with the non-normal files (that have changed and likely to have
>>> changed) and end with the normal files. This makes the job of hg status easier
>>> as, in most cases, it will need to access the non-normal entries of the
>>> dirstate. This new ordering allows hg status to stop iterating over the dirstate
>>> after processing those entries.
>>>
>>> On our large repos, for hg status, we achieve a 40% improvement.
>>> On the same repo, the cost of this change is a slowdown for writing the
>>> dirstate to disk (as we do two passes). I measured the execution time of
>>> hg debugrebuilddirstate with and without the change and observed a 5% slowdown
>>> for the overall command.
>>>
>> Is 'hg debugrebuilddirstate' a good benchmark?  What is the performance difference for 'hg commit' or 'hg up .^'.  Also, what is the actual number behind 5%?  a 500ms slow down would be worriesome, while a 20ms slow down is ok.
>
> What else do you suggest to use?
> With hg debugrebuilddirstate, on one of our big repos, on average I measure a 16ms difference: 3.2s execution without the patch vs 3.32s with the patch.
>
Cool, that sounds minimal enough.  I just wasn't sure if hg 
debugrebuilddirstate executed the code in the same way commit or update 
would, since it sets all the bits to lookup.

Patch

diff --git a/mercurial/parsers.c b/mercurial/parsers.c
--- a/mercurial/parsers.c
+++ b/mercurial/parsers.c
@@ -602,7 +602,10 @@ 
 	}
 	memcpy(p, s, l);
 	p += 20;
-	if (0 == 0) {
+	int pass;
+	/* First pass, non normal files, second pass normal files. This is to improve
+	 * status performance as status generally only need the non normal files */
+	for (pass = 0; pass <= 1; pass++) {
 		for (pos = 0; PyDict_Next(map, &pos, &k, &v); ) {
 			dirstateTupleObject *tuple;
 			char state;
@@ -610,6 +613,7 @@ 
 			Py_ssize_t len, l;
 			PyObject *o;
 			char *t;
+			int normal;
 
 			if (!dirstate_tuple_check(v)) {
 				PyErr_SetString(PyExc_TypeError,
@@ -622,6 +626,9 @@ 
 			mode = tuple->mode;
 			size = tuple->size;
 			mtime = tuple->mtime;
+			normal = (state == 'n' && mtime != -1);
+			if (normal != pass)
+				continue;
 			if (state == 'n' && mtime == now) {
 				/* See pure/parsers.py:pack_dirstate for why we do
 				 * this. */