Patchwork util: increase the copy buffer size in shutil's copyfile

login
register
mail settings
Submitter Tony Tung
Date May 6, 2016, 8:36 p.m.
Message ID <edee9b1c3a2849fcb119.1462567001@dev8692.prn1.facebook.com>
Download mbox | patch
Permalink /patch/14952/
State Changes Requested
Headers show

Comments

Tony Tung - May 6, 2016, 8:36 p.m.
# HG changeset patch
# User Tony Tung <ttung@fb.com>
# Date 1462566734 25200
#      Fri May 06 13:32:14 2016 -0700
# Branch stable
# Node ID edee9b1c3a2849fcb11900959ad785b202608c4f
# Parent  97811ff7964710d32cae951df1da8019b46151a2
util: increase the copy buffer size in shutil's copyfile

The default copy buffer size in shutil is 16KB.  This is far too
conservative for modern systems.  Use a 4MB buffer instead to copy files
more efficiently.

This manifests prominently with journaling large dirstates, which can be
multi-second operations.
Augie Fackler - May 6, 2016, 8:50 p.m.
On Fri, May 06, 2016 at 01:36:41PM -0700, Tony Tung wrote:
> # HG changeset patch
> # User Tony Tung <ttung@fb.com>
> # Date 1462566734 25200
> #      Fri May 06 13:32:14 2016 -0700
> # Branch stable
> # Node ID edee9b1c3a2849fcb11900959ad785b202608c4f
> # Parent  97811ff7964710d32cae951df1da8019b46151a2
> util: increase the copy buffer size in shutil's copyfile
>
> The default copy buffer size in shutil is 16KB.  This is far too
> conservative for modern systems.  Use a 4MB buffer instead to copy files
> more efficiently.
>
> This manifests prominently with journaling large dirstates, which can be
> multi-second operations.
>
> diff --git a/mercurial/util.py b/mercurial/util.py
> --- a/mercurial/util.py
> +++ b/mercurial/util.py
> @@ -20,6 +20,7 @@
>  import collections
>  import datetime
>  import errno
> +import functools
>  import gc
>  import hashlib
>  import imp
> @@ -1010,6 +1011,23 @@
>
>      return check
>
> +# The default copy buffer in python's shutil is 16KB.  This is far too
> +# conservative in modern systems.  Copy more at a time so we don't incur so much
> +# system call overhead.
> +def wrap(container, funcname, newfunc):
> +    origfunc = getattr(container, funcname)
> +    wrap = functools.partial(newfunc, origfunc)
> +    functools.update_wrapper(wrap, origfunc)
> +    setattr(container, funcname, wrap)
> +
> +def copyfileobj_lgbuffer(orig, *args, **kwargs):
> +    if len(args) == 2 and 'length' not in kwargs:
> +        args = (args[0], args[1], (4 * 1024 * 1024))
> +
> +    orig(*args, **kwargs)
> +
> +wrap(shutil, 'copyfileobj', copyfileobj_lgbuffer)

Why are you monkeypatching shutil's implementation here? I'm happy to
have a helper method in util, but I'm deeply skeptical of
monkeypatching the stdlib, doubly so when it's a method we don't
currently call outside of a test.

> +
>  def copyfile(src, dest, hardlink=False, copystat=False):
>      '''copy a file, preserving mode and optionally other stat info like
>      atime/mtime'''
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
via Mercurial-devel - May 6, 2016, 9:11 p.m.
> On May 6, 2016, at 1:50 PM, Augie Fackler <raf@durin42.com> wrote:

> 

> On Fri, May 06, 2016 at 01:36:41PM -0700, Tony Tung wrote:

>> # HG changeset patch

>> # User Tony Tung <ttung@fb.com>

>> # Date 1462566734 25200

>> #      Fri May 06 13:32:14 2016 -0700

>> # Branch stable

>> # Node ID edee9b1c3a2849fcb11900959ad785b202608c4f

>> # Parent  97811ff7964710d32cae951df1da8019b46151a2

>> util: increase the copy buffer size in shutil's copyfile

>> 

>> The default copy buffer size in shutil is 16KB.  This is far too

>> conservative for modern systems.  Use a 4MB buffer instead to copy files

>> more efficiently.

>> 

>> This manifests prominently with journaling large dirstates, which can be

>> multi-second operations.

>> 

>> diff --git a/mercurial/util.py b/mercurial/util.py

>> --- a/mercurial/util.py

>> +++ b/mercurial/util.py

>> @@ -20,6 +20,7 @@

>> import collections

>> import datetime

>> import errno

>> +import functools

>> import gc

>> import hashlib

>> import imp

>> @@ -1010,6 +1011,23 @@

>> 

>>     return check

>> 

>> +# The default copy buffer in python's shutil is 16KB.  This is far too

>> +# conservative in modern systems.  Copy more at a time so we don't incur so much

>> +# system call overhead.

>> +def wrap(container, funcname, newfunc):

>> +    origfunc = getattr(container, funcname)

>> +    wrap = functools.partial(newfunc, origfunc)

>> +    functools.update_wrapper(wrap, origfunc)

>> +    setattr(container, funcname, wrap)

>> +

>> +def copyfileobj_lgbuffer(orig, *args, **kwargs):

>> +    if len(args) == 2 and 'length' not in kwargs:

>> +        args = (args[0], args[1], (4 * 1024 * 1024))

>> +

>> +    orig(*args, **kwargs)

>> +

>> +wrap(shutil, 'copyfileobj', copyfileobj_lgbuffer)

> 

> Why are you monkeypatching shutil's implementation here? I'm happy to

> have a helper method in util, but I'm deeply skeptical of

> monkeypatching the stdlib, doubly so when it's a method we don't

> currently call outside of a test.


copyfileobj is called by copyfile.  copyfile does not expose the length parameter.  Either we copypasta copyfile from shutil or we monkey patch copyfileobj’s default.  I’m not particularly wedded to either approach as each has its own downside.

Is the helper method you’re suggesting the copypaste approach?

Thanks,
Tony
Augie Fackler - May 6, 2016, 9:12 p.m.
On Fri, May 6, 2016 at 5:11 PM, Tony Tung via Mercurial-devel
<mercurial-devel@mercurial-scm.org> wrote:
>> On May 6, 2016, at 1:50 PM, Augie Fackler <raf@durin42.com> wrote:
>>
>> On Fri, May 06, 2016 at 01:36:41PM -0700, Tony Tung wrote:
>>> # HG changeset patch
>>> # User Tony Tung <ttung@fb.com>
>>> # Date 1462566734 25200
>>> #      Fri May 06 13:32:14 2016 -0700
>>> # Branch stable
>>> # Node ID edee9b1c3a2849fcb11900959ad785b202608c4f
>>> # Parent  97811ff7964710d32cae951df1da8019b46151a2
>>> util: increase the copy buffer size in shutil's copyfile

>>> +
>>> +def copyfileobj_lgbuffer(orig, *args, **kwargs):
>>> +    if len(args) == 2 and 'length' not in kwargs:
>>> +        args = (args[0], args[1], (4 * 1024 * 1024))
>>> +
>>> +    orig(*args, **kwargs)
>>> +
>>> +wrap(shutil, 'copyfileobj', copyfileobj_lgbuffer)
>>
>> Why are you monkeypatching shutil's implementation here? I'm happy to
>> have a helper method in util, but I'm deeply skeptical of
>> monkeypatching the stdlib, doubly so when it's a method we don't
>> currently call outside of a test.
>
> copyfileobj is called by copyfile.  copyfile does not expose the length parameter.  Either we copypasta copyfile from shutil or we monkey patch copyfileobj’s default.  I’m not particularly wedded to either approach as each has its own downside.
>
> Is the helper method you’re suggesting the copypaste approach?

Yep. Don't monkeypatch the stdlib.

>
> Thanks,
> Tony
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Gregory Szorc - May 6, 2016, 10:26 p.m.
On Fri, May 6, 2016 at 2:11 PM, Tony Tung via Mercurial-devel <
mercurial-devel@mercurial-scm.org> wrote:

> > On May 6, 2016, at 1:50 PM, Augie Fackler <raf@durin42.com> wrote:
> >
> > On Fri, May 06, 2016 at 01:36:41PM -0700, Tony Tung wrote:
> >> # HG changeset patch
> >> # User Tony Tung <ttung@fb.com>
> >> # Date 1462566734 25200
> >> #      Fri May 06 13:32:14 2016 -0700
> >> # Branch stable
> >> # Node ID edee9b1c3a2849fcb11900959ad785b202608c4f
> >> # Parent  97811ff7964710d32cae951df1da8019b46151a2
> >> util: increase the copy buffer size in shutil's copyfile
> >>
> >> The default copy buffer size in shutil is 16KB.  This is far too
> >> conservative for modern systems.  Use a 4MB buffer instead to copy files
> >> more efficiently.
> >>
> >> This manifests prominently with journaling large dirstates, which can be
> >> multi-second operations.
> >>
> >> diff --git a/mercurial/util.py b/mercurial/util.py
> >> --- a/mercurial/util.py
> >> +++ b/mercurial/util.py
> >> @@ -20,6 +20,7 @@
> >> import collections
> >> import datetime
> >> import errno
> >> +import functools
> >> import gc
> >> import hashlib
> >> import imp
> >> @@ -1010,6 +1011,23 @@
> >>
> >>     return check
> >>
> >> +# The default copy buffer in python's shutil is 16KB.  This is far too
> >> +# conservative in modern systems.  Copy more at a time so we don't
> incur so much
> >> +# system call overhead.
> >> +def wrap(container, funcname, newfunc):
> >> +    origfunc = getattr(container, funcname)
> >> +    wrap = functools.partial(newfunc, origfunc)
> >> +    functools.update_wrapper(wrap, origfunc)
> >> +    setattr(container, funcname, wrap)
> >> +
> >> +def copyfileobj_lgbuffer(orig, *args, **kwargs):
> >> +    if len(args) == 2 and 'length' not in kwargs:
> >> +        args = (args[0], args[1], (4 * 1024 * 1024))
> >> +
> >> +    orig(*args, **kwargs)
> >> +
> >> +wrap(shutil, 'copyfileobj', copyfileobj_lgbuffer)
> >
> > Why are you monkeypatching shutil's implementation here? I'm happy to
> > have a helper method in util, but I'm deeply skeptical of
> > monkeypatching the stdlib, doubly so when it's a method we don't
> > currently call outside of a test.
>
> copyfileobj is called by copyfile.  copyfile does not expose the length
> parameter.  Either we copypasta copyfile from shutil or we monkey patch
> copyfileobj’s default.  I’m not particularly wedded to either approach as
> each has its own downside.
>
> Is the helper method you’re suggesting the copypaste approach?
>

The implementation is trivial and is simply basic POSIX I/O primitives:
https://hg.python.org/cpython/file/4462e193f089/Lib/shutil.py#l46

If we have our wits about us, we could implement shutil.copyfile() using
fewer system calls, especially on Windows, which has CopyFileEx() to
perform file copying in a single system call. Although I'm trying to think
if we do any hardcore file copying anywhere for this to matter performance
wise.

Patch

diff --git a/mercurial/util.py b/mercurial/util.py
--- a/mercurial/util.py
+++ b/mercurial/util.py
@@ -20,6 +20,7 @@ 
 import collections
 import datetime
 import errno
+import functools
 import gc
 import hashlib
 import imp
@@ -1010,6 +1011,23 @@ 
 
     return check
 
+# The default copy buffer in python's shutil is 16KB.  This is far too
+# conservative in modern systems.  Copy more at a time so we don't incur so much
+# system call overhead.
+def wrap(container, funcname, newfunc):
+    origfunc = getattr(container, funcname)
+    wrap = functools.partial(newfunc, origfunc)
+    functools.update_wrapper(wrap, origfunc)
+    setattr(container, funcname, wrap)
+
+def copyfileobj_lgbuffer(orig, *args, **kwargs):
+    if len(args) == 2 and 'length' not in kwargs:
+        args = (args[0], args[1], (4 * 1024 * 1024))
+
+    orig(*args, **kwargs)
+
+wrap(shutil, 'copyfileobj', copyfileobj_lgbuffer)
+
 def copyfile(src, dest, hardlink=False, copystat=False):
     '''copy a file, preserving mode and optionally other stat info like
     atime/mtime'''