Patchwork [V2] util: improve iterfile so it chooses code path wisely

login
register
mail settings
Submitter Jun Wu
Date Nov. 15, 2016, 8:38 p.m.
Message ID <f3d2f4ebc4006043684d.1479242304@x1c>
Download mbox | patch
Permalink /patch/17587/
State Changes Requested
Headers show

Comments

Jun Wu - Nov. 15, 2016, 8:38 p.m.
# HG changeset patch
# User Jun Wu <quark@fb.com>
# Date 1479241551 0
#      Tue Nov 15 20:25:51 2016 +0000
# Node ID f3d2f4ebc4006043684db52e4487756dd4e2d238
# Parent  d1a0a64f6e16432333bea0476098c46a61222b9b
# Available At https://bitbucket.org/quark-zju/hg-draft
#              hg pull https://bitbucket.org/quark-zju/hg-draft -r f3d2f4ebc400
util: improve iterfile so it chooses code path wisely

We have performance concerns on "iterfile" as it is 4X slower on normal
files. While modern systems have the nice property that reading a "fast"
(on-disk) file cannot be interrupted and should be made use of.

This patch dumps the related knowledge in comments. And "iterfile" chooses
code paths wisely:
1. If fp is a normal file, use the fast path.
2. If it's CPython 3, or PyPY, use the fast path.
3. If fp is not a normal file and CPython version >= 2.7.4, use the same
   workaround (4x slower) as before.
4. If fp is not a normal file and CPython version < 2.7.4, use another
   workaround (2x slower but may block longer then necessary) which
   basically re-invents the buffer + readline logic in Python.

This will give us good confidence on both correctness and performance
dealing with EINTR in iterfile(fp) for all known Python versions.
Yuya Nishihara - Nov. 16, 2016, 12:51 p.m.
On Tue, 15 Nov 2016 20:38:24 +0000, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark@fb.com>
> # Date 1479241551 0
> #      Tue Nov 15 20:25:51 2016 +0000
> # Node ID f3d2f4ebc4006043684db52e4487756dd4e2d238
> # Parent  d1a0a64f6e16432333bea0476098c46a61222b9b
> # Available At https://bitbucket.org/quark-zju/hg-draft
> #              hg pull https://bitbucket.org/quark-zju/hg-draft -r f3d2f4ebc400
> util: improve iterfile so it chooses code path wisely

> +if (pyplatform.python_implementation() == 'CPython' and
> +    sys.version_info < (3, 0)):
> +    # There is an issue in CPython that some IO methods do not handle EINTR
> +    # correctly. The following table shows what CPython version (and functions)
> +    # are affected (Y: has the EINTR bug, N: otherwise):
> +    #
> +    #                | < 2.7.4 | 2.7.4 to 2.7.12 | >= 3.0
> +    #   --------------------------------------------------
> +    #    fp.__iter__ | Y       | Y               | N
> +    #    fp.read*    | Y       | N [1]           | N
> +    #
> +    # [1]: fixed by hg changeset 67dc99a989cd.
> +    #
> +    # Here we workaround the EINTR issue for fileobj.__iter__. Other methods
> +    # like "read*" are ignored for now, as Python < 2.7.4 is a minority.
> +    #
> +    # Although we can workaround the EINTR issue for fp.__iter__, it is slower:
> +    # "for x in fp" is 4x faster than "for x in iter(fp.readline, '')" in
> +    # CPython 2, because the latter maintains an internal readahead buffer.

Do you mean "the former" ?

> +    # On modern systems like Linux, the "read" syscall cannot be interrupted
> +    # when reading "fast" files like on-disk files. So the EINTR issue only
> +    # affects things like pipes, sockets, ttys etc. We treat "normal" (S_ISREG)
> +    # files approximately as "fast" files and use the fast (unsafe) code path,
> +    # to minimize the performance impact.
> +    if sys.version_info >= (2, 7, 4):
> +        # fp.readline deals with EINTR correctly, use it as a workaround
> +        def _safeiterfile(fp):
> +            return iter(fp.readline, '')
> +    else:
> +        # fp.read* are broken too, manually deal with EINTR in a stupid way
> +        # note: this may block longer than necessary because of bufsize.
> +        def _safeiterfile(fp, bufsize=4096):
> +            fd = fp.fileno()
> +            line = ''
> +            while True:
> +                try:
> +                    buf = os.read(fd, bufsize)
> +                except OSError as ex:
> +                    if ex.errno == errno.EINTR:
> +                        continue
> +                    else:
> +                        raise
> +                line += buf
> +                if '\n' in buf:
> +                    splitted = line.splitlines(True)
> +                    line = ''
> +                    for l in splitted:
> +                        if l[-1] == '\n':
> +                            yield l
> +                        else:
> +                            line = l
> +                if not buf:
> +                    break

Missed the last line if not ends with '\n'.

> +    def iterfile(fp):
> +        fastpath = True
> +        if type(fp) is file:
> +            fastpath = stat.S_ISREG(os.fstat(fp.fileno()).st_mode)
> +        if fastpath:
> +            return fp
> +        else:
> +            return _safeiterfile(fp)
> +else:
> +    # PyPy and CPython 3 do not have the EINTR issue thus no workaround needed.
> +    def iterfile(fp):
> +        return fp
Jun Wu - Nov. 16, 2016, 2:17 p.m.
Excerpts from Yuya Nishihara's message of 2016-11-16 21:51:50 +0900:
> On Tue, 15 Nov 2016 20:38:24 +0000, Jun Wu wrote:
> > # HG changeset patch
> > # User Jun Wu <quark@fb.com>
> > # Date 1479241551 0
> > #      Tue Nov 15 20:25:51 2016 +0000
> > # Node ID f3d2f4ebc4006043684db52e4487756dd4e2d238
> > # Parent  d1a0a64f6e16432333bea0476098c46a61222b9b
> > # Available At https://bitbucket.org/quark-zju/hg-draft 
> > #              hg pull https://bitbucket.org/quark-zju/hg-draft  -r f3d2f4ebc400
> > util: improve iterfile so it chooses code path wisely
> 
> > +if (pyplatform.python_implementation() == 'CPython' and
> > +    sys.version_info < (3, 0)):
> > +    # There is an issue in CPython that some IO methods do not handle EINTR
> > +    # correctly. The following table shows what CPython version (and functions)
> > +    # are affected (Y: has the EINTR bug, N: otherwise):
> > +    #
> > +    #                | < 2.7.4 | 2.7.4 to 2.7.12 | >= 3.0
> > +    #   --------------------------------------------------
> > +    #    fp.__iter__ | Y       | Y               | N
> > +    #    fp.read*    | Y       | N [1]           | N
> > +    #
> > +    # [1]: fixed by hg changeset 67dc99a989cd.
> > +    #
> > +    # Here we workaround the EINTR issue for fileobj.__iter__. Other methods
> > +    # like "read*" are ignored for now, as Python < 2.7.4 is a minority.
> > +    #
> > +    # Although we can workaround the EINTR issue for fp.__iter__, it is slower:
> > +    # "for x in fp" is 4x faster than "for x in iter(fp.readline, '')" in
> > +    # CPython 2, because the latter maintains an internal readahead buffer.
> 
> Do you mean "the former" ?

Good catch. Was trying to refer to "CPython" and saw it's at the right.

> > +    # On modern systems like Linux, the "read" syscall cannot be interrupted
> > +    # when reading "fast" files like on-disk files. So the EINTR issue only
> > +    # affects things like pipes, sockets, ttys etc. We treat "normal" (S_ISREG)
> > +    # files approximately as "fast" files and use the fast (unsafe) code path,
> > +    # to minimize the performance impact.
> > +    if sys.version_info >= (2, 7, 4):
> > +        # fp.readline deals with EINTR correctly, use it as a workaround
> > +        def _safeiterfile(fp):
> > +            return iter(fp.readline, '')
> > +    else:
> > +        # fp.read* are broken too, manually deal with EINTR in a stupid way
> > +        # note: this may block longer than necessary because of bufsize.
> > +        def _safeiterfile(fp, bufsize=4096):
> > +            fd = fp.fileno()
> > +            line = ''
> > +            while True:
> > +                try:
> > +                    buf = os.read(fd, bufsize)
> > +                except OSError as ex:
> > +                    if ex.errno == errno.EINTR:
> > +                        continue
> > +                    else:
> > +                        raise
> > +                line += buf
> > +                if '\n' in buf:
> > +                    splitted = line.splitlines(True)
> > +                    line = ''
> > +                    for l in splitted:
> > +                        if l[-1] == '\n':
> > +                            yield l
> > +                        else:
> > +                            line = l
> > +                if not buf:
> > +                    break
> 
> Missed the last line if not ends with '\n'.

Aside from that, it seems the test cases introduced by 67dc99a989cd could be
weak - I'll double check if "buf = os.read(fd, bufsize)" can "read some
bytes, then interruptted" so "buf" gets modified, even on EINTR.
If not, the only safe way seems to be the very stupid os.read(fd, 1) (an
previous unsent version). It's 30x slower. I think we probably don't want a
30x slower workaround. As we still support Python 2.6, this should be solved
before the SIGCHLD patch.

> > +    def iterfile(fp):
> > +        fastpath = True
> > +        if type(fp) is file:
> > +            fastpath = stat.S_ISREG(os.fstat(fp.fileno()).st_mode)
> > +        if fastpath:
> > +            return fp
> > +        else:
> > +            return _safeiterfile(fp)
> > +else:
> > +    # PyPy and CPython 3 do not have the EINTR issue thus no workaround needed.
> > +    def iterfile(fp):
> > +        return fp
Yuya Nishihara - Nov. 16, 2016, 3:25 p.m.
On Wed, 16 Nov 2016 14:17:10 +0000, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-11-16 21:51:50 +0900:
> > On Tue, 15 Nov 2016 20:38:24 +0000, Jun Wu wrote:
> > > +                try:
> > > +                    buf = os.read(fd, bufsize)
> > > +                except OSError as ex:
> > > +                    if ex.errno == errno.EINTR:
> > > +                        continue
> > > +                    else:
> > > +                        raise
> > > +                line += buf
> > > +                if '\n' in buf:
> > > +                    splitted = line.splitlines(True)
> > > +                    line = ''
> > > +                    for l in splitted:
> > > +                        if l[-1] == '\n':
> > > +                            yield l
> > > +                        else:
> > > +                            line = l
> > > +                if not buf:
> > > +                    break
> > 
> > Missed the last line if not ends with '\n'.
> 
> Aside from that, it seems the test cases introduced by 67dc99a989cd could be
> weak - I'll double check if "buf = os.read(fd, bufsize)" can "read some
> bytes, then interruptted" so "buf" gets modified, even on EINTR.

Perhaps os.read() is a thin wrapper over read(2), which says "EINTR The call
was interrupted by a signal before any data was read." In other words, read()
never raises EINTR if it has some data to return.
Jun Wu - Nov. 16, 2016, 4:16 p.m.
Excerpts from Yuya Nishihara's message of 2016-11-17 00:25:39 +0900:
> Perhaps os.read() is a thin wrapper over read(2), which says "EINTR The call
> was interrupted by a signal before any data was read." In other words, read()
> never raises EINTR if it has some data to return.

Nice property! It's a life saver.
I'm now a stdio hater like the CPython dev haypo.

Related read:
https://haypo-notes.readthedocs.io/python.html#bugs-in-the-c-stdio-used-by-the-python-i-o

Patch

diff --git a/mercurial/util.py b/mercurial/util.py
--- a/mercurial/util.py
+++ b/mercurial/util.py
@@ -25,8 +25,10 @@  import hashlib
 import imp
 import os
+import platform as pyplatform
 import re as remod
 import shutil
 import signal
 import socket
+import stat
 import string
 import subprocess
@@ -2191,8 +2193,71 @@  def wrap(line, width, initindent='', han
     return wrapper.fill(line).encode(encoding.encoding)
 
-def iterfile(fp):
-    """like fp.__iter__ but does not have issues with EINTR. Python 2.7.12 is
-    known to have such issues."""
-    return iter(fp.readline, '')
+if (pyplatform.python_implementation() == 'CPython' and
+    sys.version_info < (3, 0)):
+    # There is an issue in CPython that some IO methods do not handle EINTR
+    # correctly. The following table shows what CPython version (and functions)
+    # are affected (Y: has the EINTR bug, N: otherwise):
+    #
+    #                | < 2.7.4 | 2.7.4 to 2.7.12 | >= 3.0
+    #   --------------------------------------------------
+    #    fp.__iter__ | Y       | Y               | N
+    #    fp.read*    | Y       | N [1]           | N
+    #
+    # [1]: fixed by hg changeset 67dc99a989cd.
+    #
+    # Here we workaround the EINTR issue for fileobj.__iter__. Other methods
+    # like "read*" are ignored for now, as Python < 2.7.4 is a minority.
+    #
+    # Although we can workaround the EINTR issue for fp.__iter__, it is slower:
+    # "for x in fp" is 4x faster than "for x in iter(fp.readline, '')" in
+    # CPython 2, because the latter maintains an internal readahead buffer.
+    #
+    # On modern systems like Linux, the "read" syscall cannot be interrupted
+    # when reading "fast" files like on-disk files. So the EINTR issue only
+    # affects things like pipes, sockets, ttys etc. We treat "normal" (S_ISREG)
+    # files approximately as "fast" files and use the fast (unsafe) code path,
+    # to minimize the performance impact.
+    if sys.version_info >= (2, 7, 4):
+        # fp.readline deals with EINTR correctly, use it as a workaround
+        def _safeiterfile(fp):
+            return iter(fp.readline, '')
+    else:
+        # fp.read* are broken too, manually deal with EINTR in a stupid way
+        # note: this may block longer than necessary because of bufsize.
+        def _safeiterfile(fp, bufsize=4096):
+            fd = fp.fileno()
+            line = ''
+            while True:
+                try:
+                    buf = os.read(fd, bufsize)
+                except OSError as ex:
+                    if ex.errno == errno.EINTR:
+                        continue
+                    else:
+                        raise
+                line += buf
+                if '\n' in buf:
+                    splitted = line.splitlines(True)
+                    line = ''
+                    for l in splitted:
+                        if l[-1] == '\n':
+                            yield l
+                        else:
+                            line = l
+                if not buf:
+                    break
+
+    def iterfile(fp):
+        fastpath = True
+        if type(fp) is file:
+            fastpath = stat.S_ISREG(os.fstat(fp.fileno()).st_mode)
+        if fastpath:
+            return fp
+        else:
+            return _safeiterfile(fp)
+else:
+    # PyPy and CPython 3 do not have the EINTR issue thus no workaround needed.
+    def iterfile(fp):
+        return fp
 
 def iterlines(iterator):