Patchwork worker: change partition strategy to every Nth element

login
register
mail settings
Submitter Gregory Szorc
Date Feb. 21, 2016, 12:02 a.m.
Message ID <81b1e7ce39d837e352e5.1456012931@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/13279/
State Accepted
Headers show

Comments

Gregory Szorc - Feb. 21, 2016, 12:02 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1456012604 28800
#      Sat Feb 20 15:56:44 2016 -0800
# Node ID 81b1e7ce39d837e352e5ff9e799e512600ab4ba6
# Parent  8449ef66f732a71008dc887ffd6efbfb6dc64ee0
worker: change partition strategy to every Nth element

The only consumer of the worker pool code today is `hg update`.

Previously, the algorithm to partition work to each worker process
preserved input list ordering. We'd take the first N elements, then
the next N elements, etc. Measurements on mozilla-central demonstrate
this isn't an optimal partitioning strategy.

I added debug code to print when workers were exiting. When performing
a working copy update on a previously empty working copy of
mozilla-central, I noticed that process lifetimes were all over the
map. One worker would complete after 7s. Many would complete after
12s. And another worker would often take >16s. This behavior occurred
for many worker process counts and was more pronounced on some than
others.

What I suspect is happening is some workers end up with lots of
small files and others with large files. This is because the update
code passes in actions according to sorted filenames. And, directories
under tend to accumulate similar files. For example, test directories
often consist of many small test files and media directories contain
binary (often larger) media files.

This patch changes the partitioning algorithm to select every Nth
element from the input list. Each worker thus has a similar composition
of files to operate on.

The result of this change is that worker processes now all tend to exit
around the same time. The possibility of a long pole due to being
unlucky and receiving all the large files has been mitigated. Overall
execution time seems to drop, but not by a statistically significant
amount on mozilla-central. However, repositories with directories
containing many large files will likely show a drop.

There shouldn't be any regressions due to partial manifest decoding
because the update code already iterates the manifest to determine
what files to operate on, so the manifest should already be decoded.
Augie Fackler - Feb. 22, 2016, 10:46 p.m.
On Sat, Feb 20, 2016 at 04:02:11PM -0800, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1456012604 28800
> #      Sat Feb 20 15:56:44 2016 -0800
> # Node ID 81b1e7ce39d837e352e5ff9e799e512600ab4ba6
> # Parent  8449ef66f732a71008dc887ffd6efbfb6dc64ee0
> worker: change partition strategy to every Nth element

queued this, thanks

>
> The only consumer of the worker pool code today is `hg update`.
>
> Previously, the algorithm to partition work to each worker process
> preserved input list ordering. We'd take the first N elements, then
> the next N elements, etc. Measurements on mozilla-central demonstrate
> this isn't an optimal partitioning strategy.
>
> I added debug code to print when workers were exiting. When performing
> a working copy update on a previously empty working copy of
> mozilla-central, I noticed that process lifetimes were all over the
> map. One worker would complete after 7s. Many would complete after
> 12s. And another worker would often take >16s. This behavior occurred
> for many worker process counts and was more pronounced on some than
> others.
>
> What I suspect is happening is some workers end up with lots of
> small files and others with large files. This is because the update
> code passes in actions according to sorted filenames. And, directories
> under tend to accumulate similar files. For example, test directories
> often consist of many small test files and media directories contain
> binary (often larger) media files.
>
> This patch changes the partitioning algorithm to select every Nth
> element from the input list. Each worker thus has a similar composition
> of files to operate on.
>
> The result of this change is that worker processes now all tend to exit
> around the same time. The possibility of a long pole due to being
> unlucky and receiving all the large files has been mitigated. Overall
> execution time seems to drop, but not by a statistically significant
> amount on mozilla-central. However, repositories with directories
> containing many large files will likely show a drop.
>
> There shouldn't be any regressions due to partial manifest decoding
> because the update code already iterates the manifest to determine
> what files to operate on, so the manifest should already be decoded.
>
> diff --git a/mercurial/worker.py b/mercurial/worker.py
> --- a/mercurial/worker.py
> +++ b/mercurial/worker.py
> @@ -147,19 +147,16 @@ def _posixexitstatus(code):
>      elif os.WIFSIGNALED(code):
>          return -os.WTERMSIG(code)
>
>  if os.name != 'nt':
>      _platformworker = _posixworker
>      _exitstatus = _posixexitstatus
>
>  def partition(lst, nslices):
> -    '''partition a list into N slices of equal size'''
> -    n = len(lst)
> -    chunk, slop = n / nslices, n % nslices
> -    end = 0
> -    for i in xrange(nslices):
> -        start = end
> -        end = start + chunk
> -        if slop:
> -            end += 1
> -            slop -= 1
> -        yield lst[start:end]
> +    '''partition a list into N slices of roughly equal size
> +
> +    The current strategy takes every Nth element from the input. If
> +    we ever write workers that need to preserve grouping in input
> +    we should consider allowing callers to specify a partition strategy.
> +    '''
> +    for i in range(nslices):
> +        yield lst[i::nslices]
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Matt Mackall - Feb. 23, 2016, 8:28 p.m.
On Mon, 2016-02-22 at 17:46 -0500, Augie Fackler wrote:
> On Sat, Feb 20, 2016 at 04:02:11PM -0800, Gregory Szorc wrote:
> > # HG changeset patch
> > # User Gregory Szorc <gregory.szorc@gmail.com>
> > # Date 1456012604 28800
> > #      Sat Feb 20 15:56:44 2016 -0800
> > # Node ID 81b1e7ce39d837e352e5ff9e799e512600ab4ba6
> > # Parent  8449ef66f732a71008dc887ffd6efbfb6dc64ee0
> > worker: change partition strategy to every Nth element
> 
> queued this, thanks

Both this and the original partitioning are rather unfortunate from a disk
locality standpoint.

Single-threaded Mercurial makes a point of creating and visiting files in a
fixed order (alphabetical). When creating files in order, a typical filesystem
is likely to allocate them on nearby regions on disk. Thus, when revisiting in
the same order, locality is maximized and various forms of OS and disk-level
caching and read-ahead get a chance to work.

This effect can be quite significant on spinning disks. I discovered it circa
Mercurial v0.4 when revlogs were named by hashes of filenames. Tarring a repo
and copying it to another disk effectively randomized the revlog ordering on
disk by sorting the revlogs by hash and suddenly performance of my kernel
checkout benchmark dropped by ~10x because the "working set" of sectors visited
no longer fit in the drive's cache and the workload switched from streaming to
random I/O. 

What we should really be doing is have workers read filenames from a ordered
queue. This preserves locality and also keeps any worker from getting more than
one file out of balance.

-- 
Mathematics is the supreme nostalgia of our time.

Patch

diff --git a/mercurial/worker.py b/mercurial/worker.py
--- a/mercurial/worker.py
+++ b/mercurial/worker.py
@@ -147,19 +147,16 @@  def _posixexitstatus(code):
     elif os.WIFSIGNALED(code):
         return -os.WTERMSIG(code)
 
 if os.name != 'nt':
     _platformworker = _posixworker
     _exitstatus = _posixexitstatus
 
 def partition(lst, nslices):
-    '''partition a list into N slices of equal size'''
-    n = len(lst)
-    chunk, slop = n / nslices, n % nslices
-    end = 0
-    for i in xrange(nslices):
-        start = end
-        end = start + chunk
-        if slop:
-            end += 1
-            slop -= 1
-        yield lst[start:end]
+    '''partition a list into N slices of roughly equal size
+
+    The current strategy takes every Nth element from the input. If
+    we ever write workers that need to preserve grouping in input
+    we should consider allowing callers to specify a partition strategy.
+    '''
+    for i in range(nslices):
+        yield lst[i::nslices]