Patchwork [3,of,6] patch: migrate to util.iterfile

login
register
mail settings
Submitter Jun Wu
Date Nov. 14, 2016, 11:35 p.m.
Message ID <d1637df5ffd91e95da25.1479166555@x1c>
Download mbox | patch
Permalink /patch/17571/
State Accepted
Headers show

Comments

Jun Wu - Nov. 14, 2016, 11:35 p.m.
# HG changeset patch
# User Jun Wu <quark@fb.com>
# Date 1479165246 0
#      Mon Nov 14 23:14:06 2016 +0000
# Node ID d1637df5ffd91e95da25213f06f346adb3269ace
# Parent  8097b8c66952003addd5b683a14265c1d3cc201f
# Available At https://bitbucket.org/quark-zju/hg-draft
#              hg pull https://bitbucket.org/quark-zju/hg-draft -r d1637df5ffd9
patch: migrate to util.iterfile
Yuya Nishihara - Nov. 15, 2016, 12:14 p.m.
On Mon, 14 Nov 2016 23:35:55 +0000, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark@fb.com>
> # Date 1479165246 0
> #      Mon Nov 14 23:14:06 2016 +0000
> # Node ID d1637df5ffd91e95da25213f06f346adb3269ace
> # Parent  8097b8c66952003addd5b683a14265c1d3cc201f
> # Available At https://bitbucket.org/quark-zju/hg-draft
> #              hg pull https://bitbucket.org/quark-zju/hg-draft -r d1637df5ffd9
> patch: migrate to util.iterfile

Not all read()s are interruptible by signals. I don't remember the detail, but
signal(7) of Linux says "read(2) ... calls on "slow" devices. ... Note that a
(local) disk is not a slow device according to this definition; I/O operations
on disk devices are not interrupted by signals."

Since readline() doesn't have cache in CPython layer, it would be slightly
slower than fp.__iter__().
Jun Wu - Nov. 15, 2016, 1:38 p.m.
Excerpts from Yuya Nishihara's message of 2016-11-15 21:14:10 +0900:
> On Mon, 14 Nov 2016 23:35:55 +0000, Jun Wu wrote:
> > # HG changeset patch
> > # User Jun Wu <quark@fb.com>
> > # Date 1479165246 0
> > #      Mon Nov 14 23:14:06 2016 +0000
> > # Node ID d1637df5ffd91e95da25213f06f346adb3269ace
> > # Parent  8097b8c66952003addd5b683a14265c1d3cc201f
> > # Available At https://bitbucket.org/quark-zju/hg-draft 
> > #              hg pull https://bitbucket.org/quark-zju/hg-draft  -r d1637df5ffd9
> > patch: migrate to util.iterfile
> 
> Not all read()s are interruptible by signals. I don't remember the detail, but
> signal(7) of Linux says "read(2) ... calls on "slow" devices. ... Note that a
> (local) disk is not a slow device according to this definition; I/O operations
> on disk devices are not interrupted by signals."

Good to know! I guess modern OS have similar decisions, although POSIX seems
to be not helpful here.

> Since readline() doesn't have cache in CPython layer, it would be slightly
> slower than fp.__iter__().

A quick benchmark shows it's ~4x slower.

Maybe make the function smarter to only use the workaround for a pipe?

  if hassafeattr(fp, 'fileno') and stat.S_ISFIFO(os.fstat(fp.fileno()).st_mode):
      return iter(fp.readline, '')
  else:
      return fp
Pierre-Yves David - Nov. 15, 2016, 2:07 p.m.
On 11/15/2016 01:38 PM, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-11-15 21:14:10 +0900:
>> On Mon, 14 Nov 2016 23:35:55 +0000, Jun Wu wrote:
>>> # HG changeset patch
>>> # User Jun Wu <quark@fb.com>
>>> # Date 1479165246 0
>>> #      Mon Nov 14 23:14:06 2016 +0000
>>> # Node ID d1637df5ffd91e95da25213f06f346adb3269ace
>>> # Parent  8097b8c66952003addd5b683a14265c1d3cc201f
>>> # Available At https://bitbucket.org/quark-zju/hg-draft
>>> #              hg pull https://bitbucket.org/quark-zju/hg-draft  -r d1637df5ffd9
>>> patch: migrate to util.iterfile
>>
>> Not all read()s are interruptible by signals. I don't remember the detail, but
>> signal(7) of Linux says "read(2) ... calls on "slow" devices. ... Note that a
>> (local) disk is not a slow device according to this definition; I/O operations
>> on disk devices are not interrupted by signals."
>
> Good to know! I guess modern OS have similar decisions, although POSIX seems
> to be not helpful here.
>
>> Since readline() doesn't have cache in CPython layer, it would be slightly
>> slower than fp.__iter__().
>
> A quick benchmark shows it's ~4x slower.
>
> Maybe make the function smarter to only use the workaround for a pipe?
>
>   if hassafeattr(fp, 'fileno') and stat.S_ISFIFO(os.fstat(fp.fileno()).st_mode):
>       return iter(fp.readline, '')
>   else:
>       return fp

Should we drop this series from hg-committed or a followup is enough?

Cheers,
Martijn Pieters - Nov. 15, 2016, 2:45 p.m.
On 15 November 2016 at 12:14, Yuya Nishihara <yuya@tcha.org> wrote:

> Not all read()s are interruptible by signals. I don't remember the detail,
> but
> signal(7) of Linux says "read(2) ... calls on "slow" devices. ... Note
> that a
> (local) disk is not a slow device according to this definition; I/O
> operations
> on disk devices are not interrupted by signals."
>
> Since readline() doesn't have cache in CPython layer, it would be slightly
> slower than fp.__iter__().
>

It looks like this isn't the case for Windows; I found
http://bugs.python.org/issue1633941, where you can provoke the bug by using
^Z while reading from stdin.

Also, before Python 2.7.4 (including 2.6) there is no EINTR handling *at
all*, all reading operations that can be interrupted are affected. See
http://bugs.python.org/issue12268, which aimed at fixing that issue but
managed to miss readahead.
Jun Wu - Nov. 15, 2016, 3:11 p.m.
Excerpts from Pierre-Yves David's message of 2016-11-15 14:07:15 +0000:
> 
> On 11/15/2016 01:38 PM, Jun Wu wrote:
> > Excerpts from Yuya Nishihara's message of 2016-11-15 21:14:10 +0900:
> >> On Mon, 14 Nov 2016 23:35:55 +0000, Jun Wu wrote:
> >>> # HG changeset patch
> >>> # User Jun Wu <quark@fb.com>
> >>> # Date 1479165246 0
> >>> #      Mon Nov 14 23:14:06 2016 +0000
> >>> # Node ID d1637df5ffd91e95da25213f06f346adb3269ace
> >>> # Parent  8097b8c66952003addd5b683a14265c1d3cc201f
> >>> # Available At https://bitbucket.org/quark-zju/hg-draft 
> >>> #              hg pull https://bitbucket.org/quark-zju/hg-draft   -r d1637df5ffd9
> >>> patch: migrate to util.iterfile
> >>
> >> Not all read()s are interruptible by signals. I don't remember the detail, but
> >> signal(7) of Linux says "read(2) ... calls on "slow" devices. ... Note that a
> >> (local) disk is not a slow device according to this definition; I/O operations
> >> on disk devices are not interrupted by signals."
> >
> > Good to know! I guess modern OS have similar decisions, although POSIX seems
> > to be not helpful here.
> >
> >> Since readline() doesn't have cache in CPython layer, it would be slightly
> >> slower than fp.__iter__().
> >
> > A quick benchmark shows it's ~4x slower.
> >
> > Maybe make the function smarter to only use the workaround for a pipe?
> >
> >   if hassafeattr(fp, 'fileno') and stat.S_ISFIFO(os.fstat(fp.fileno()).st_mode):
> >       return iter(fp.readline, '')
> >   else:
> >       return fp
> 
> Should we drop this series from hg-committed or a followup is enough?

+1. I guess I have to abandon the SIGCHILD patch, too (sadly). As long as we
support Python 2.6, we are at risk that any fileobj.read() is broken.

> 
> Cheers,
>
Jun Wu - Nov. 15, 2016, 3:14 p.m.
Excerpts from Jun Wu's message of 2016-11-15 15:11:52 +0000:
> Excerpts from Pierre-Yves David's message of 2016-11-15 14:07:15 +0000:
> > 
> > On 11/15/2016 01:38 PM, Jun Wu wrote:
> > > Excerpts from Yuya Nishihara's message of 2016-11-15 21:14:10 +0900:
> > >> On Mon, 14 Nov 2016 23:35:55 +0000, Jun Wu wrote:
> > >>> # HG changeset patch
> > >>> # User Jun Wu <quark@fb.com>
> > >>> # Date 1479165246 0
> > >>> #      Mon Nov 14 23:14:06 2016 +0000
> > >>> # Node ID d1637df5ffd91e95da25213f06f346adb3269ace
> > >>> # Parent  8097b8c66952003addd5b683a14265c1d3cc201f
> > >>> # Available At https://bitbucket.org/quark-zju/hg-draft 
> > >>> #              hg pull https://bitbucket.org/quark-zju/hg-draft   -r d1637df5ffd9
> > >>> patch: migrate to util.iterfile
> > >>
> > >> Not all read()s are interruptible by signals. I don't remember the detail, but
> > >> signal(7) of Linux says "read(2) ... calls on "slow" devices. ... Note that a
> > >> (local) disk is not a slow device according to this definition; I/O operations
> > >> on disk devices are not interrupted by signals."
> > >
> > > Good to know! I guess modern OS have similar decisions, although POSIX seems
> > > to be not helpful here.
> > >
> > >> Since readline() doesn't have cache in CPython layer, it would be slightly
> > >> slower than fp.__iter__().
> > >
> > > A quick benchmark shows it's ~4x slower.
> > >
> > > Maybe make the function smarter to only use the workaround for a pipe?
> > >
> > >   if hassafeattr(fp, 'fileno') and stat.S_ISFIFO(os.fstat(fp.fileno()).st_mode):
> > >       return iter(fp.readline, '')
> > >   else:
> > >       return fp
> > 
> > Should we drop this series from hg-committed or a followup is enough?
> 
> +1. I guess I have to abandon the SIGCHILD patch, too (sadly). As long as we
> support Python 2.6, we are at risk that any fileobj.read() is broken.

Actually, I think the worker issue is still worthwhile to fix (the current
situation is better than none anyway). So I will send follow-ups:

  - Fix util.iterfile perf
  - Only use SIGCHILD handler in worker, if Python >= 2.7.4
> 
> > 
> > Cheers,
> >
Yuya Nishihara - Nov. 15, 2016, 3:39 p.m.
On Tue, 15 Nov 2016 15:14:09 +0000, Jun Wu wrote:
> Excerpts from Jun Wu's message of 2016-11-15 15:11:52 +0000:
> > Excerpts from Pierre-Yves David's message of 2016-11-15 14:07:15 +0000:
> > > 
> > > On 11/15/2016 01:38 PM, Jun Wu wrote:
> > > > Excerpts from Yuya Nishihara's message of 2016-11-15 21:14:10 +0900:
> > > >> On Mon, 14 Nov 2016 23:35:55 +0000, Jun Wu wrote:
> > > >>> # HG changeset patch
> > > >>> # User Jun Wu <quark@fb.com>
> > > >>> # Date 1479165246 0
> > > >>> #      Mon Nov 14 23:14:06 2016 +0000
> > > >>> # Node ID d1637df5ffd91e95da25213f06f346adb3269ace
> > > >>> # Parent  8097b8c66952003addd5b683a14265c1d3cc201f
> > > >>> # Available At https://bitbucket.org/quark-zju/hg-draft 
> > > >>> #              hg pull https://bitbucket.org/quark-zju/hg-draft   -r d1637df5ffd9
> > > >>> patch: migrate to util.iterfile
> > > >>
> > > >> Not all read()s are interruptible by signals. I don't remember the detail, but
> > > >> signal(7) of Linux says "read(2) ... calls on "slow" devices. ... Note that a
> > > >> (local) disk is not a slow device according to this definition; I/O operations
> > > >> on disk devices are not interrupted by signals."
> > > >
> > > > Good to know! I guess modern OS have similar decisions, although POSIX seems
> > > > to be not helpful here.
> > > >
> > > >> Since readline() doesn't have cache in CPython layer, it would be slightly
> > > >> slower than fp.__iter__().
> > > >
> > > > A quick benchmark shows it's ~4x slower.
> > > >
> > > > Maybe make the function smarter to only use the workaround for a pipe?
> > > >
> > > >   if hassafeattr(fp, 'fileno') and stat.S_ISFIFO(os.fstat(fp.fileno()).st_mode):
> > > >       return iter(fp.readline, '')

No idea if it can cover all cases. (e.g. i don't know if tty is the "slow"
device.)

> > > Should we drop this series from hg-committed or a followup is enough?
> > 
> > +1. I guess I have to abandon the SIGCHILD patch, too (sadly). As long as we
> > support Python 2.6, we are at risk that any fileobj.read() is broken.
> 
> Actually, I think the worker issue is still worthwhile to fix (the current
> situation is better than none anyway). So I will send follow-ups:
> 
>   - Fix util.iterfile perf
>   - Only use SIGCHILD handler in worker, if Python >= 2.7.4

It sounds good to patch worker.py which is known to have the problem, and we
wouldn't care if the read of progress messages gets 4x slower.
Jun Wu - Nov. 15, 2016, 3:46 p.m.
Excerpts from Yuya Nishihara's message of 2016-11-16 00:39:46 +0900:
> On Tue, 15 Nov 2016 15:14:09 +0000, Jun Wu wrote:
> > Excerpts from Jun Wu's message of 2016-11-15 15:11:52 +0000:
> > > Excerpts from Pierre-Yves David's message of 2016-11-15 14:07:15 +0000:
> > > > 
> > > > On 11/15/2016 01:38 PM, Jun Wu wrote:
> > > > > Excerpts from Yuya Nishihara's message of 2016-11-15 21:14:10 +0900:
> > > > >> On Mon, 14 Nov 2016 23:35:55 +0000, Jun Wu wrote:
> > > > >>> # HG changeset patch
> > > > >>> # User Jun Wu <quark@fb.com>
> > > > >>> # Date 1479165246 0
> > > > >>> #      Mon Nov 14 23:14:06 2016 +0000
> > > > >>> # Node ID d1637df5ffd91e95da25213f06f346adb3269ace
> > > > >>> # Parent  8097b8c66952003addd5b683a14265c1d3cc201f
> > > > >>> # Available At https://bitbucket.org/quark-zju/hg-draft  
> > > > >>> #              hg pull https://bitbucket.org/quark-zju/hg-draft    -r d1637df5ffd9
> > > > >>> patch: migrate to util.iterfile
> > > > >>
> > > > >> Not all read()s are interruptible by signals. I don't remember the detail, but
> > > > >> signal(7) of Linux says "read(2) ... calls on "slow" devices. ... Note that a
> > > > >> (local) disk is not a slow device according to this definition; I/O operations
> > > > >> on disk devices are not interrupted by signals."
> > > > >
> > > > > Good to know! I guess modern OS have similar decisions, although POSIX seems
> > > > > to be not helpful here.
> > > > >
> > > > >> Since readline() doesn't have cache in CPython layer, it would be slightly
> > > > >> slower than fp.__iter__().
> > > > >
> > > > > A quick benchmark shows it's ~4x slower.
> > > > >
> > > > > Maybe make the function smarter to only use the workaround for a pipe?
> > > > >
> > > > >   if hassafeattr(fp, 'fileno') and stat.S_ISFIFO(os.fstat(fp.fileno()).st_mode):
> > > > >       return iter(fp.readline, '')
> 
> No idea if it can cover all cases. (e.g. i don't know if tty is the "slow"
> device.)
> 
> > > > Should we drop this series from hg-committed or a followup is enough?
> > > 
> > > +1. I guess I have to abandon the SIGCHILD patch, too (sadly). As long as we
> > > support Python 2.6, we are at risk that any fileobj.read() is broken.
> > 
> > Actually, I think the worker issue is still worthwhile to fix (the current
> > situation is better than none anyway). So I will send follow-ups:
> > 
> >   - Fix util.iterfile perf
> >   - Only use SIGCHILD handler in worker, if Python >= 2.7.4
> 
> It sounds good to patch worker.py which is known to have the problem, and we
> wouldn't care if the read of progress messages gets 4x slower.

Given the fact that you can do "hg import - < file". I think it's worthwhile
to patch all places in the code base. The patch I'm currently writing will
special case S_ISREG files so it should have little perf impact.

Patch

diff --git a/mercurial/patch.py b/mercurial/patch.py
--- a/mercurial/patch.py
+++ b/mercurial/patch.py
@@ -1070,5 +1070,5 @@  the hunk is left unchanged.
                     patchfp = open(patchfn)
                     ncpatchfp = stringio()
-                    for line in patchfp:
+                    for line in util.iterfile(patchfp):
                         if not line.startswith('#'):
                             ncpatchfp.write(line)
@@ -2013,5 +2013,5 @@  def _externalpatch(ui, repo, patcher, pa
                                        util.shellquote(patchname)))
     try:
-        for line in fp:
+        for line in util.iterfile(fp):
             line = line.rstrip()
             ui.note(line + '\n')