Patchwork py3: make ui.write() call fout.flush() per line if fout is a tty or pager

login
register
mail settings
Submitter Yuya Nishihara
Date June 7, 2020, 11:55 a.m.
Message ID <ee3cd52b4b338e05b0a2.1591530903@mimosa>
Download mbox | patch
Permalink /patch/46483/
State New
Headers show

Comments

Yuya Nishihara - June 7, 2020, 11:55 a.m.
# HG changeset patch
# User Yuya Nishihara <yuya@tcha.org>
# Date 1591518045 -32400
#      Sun Jun 07 17:20:45 2020 +0900
# Node ID ee3cd52b4b338e05b0a22cb87580d78f1ad7763d
# Parent  03ba7de6a8b9ab339a6dafab969e890228b4c2f1
py3: make ui.write() call fout.flush() per line if fout is a tty or pager

Since sys.stdout.buffer of Python 3 can't be configured as a line-buffered
stream, we need a workaround:

 a. make callers do flush() where appropriate
 b. make ui insert flush() automatically
 c. use unbuffered stream

With my testing, inserting flush() automatically showed a better result than
using unbuffered io, so this patch implements (b). Since ui._isatty() call
isn't cheap, the result is cached. I also made ui.write() not scan the
entire string because it's unlikely that a string including '\n' doesn't
end with '\n'.

Maybe we can backport a subset of this patch to stable to mitigate the
ui.status() issue.

Benchmark result:

  % HGRCPATH=/dev/null python3 ./hg --cwd ~/mirrors/mozilla-central \
  --time --pager no status --all

  a. base
  time: real 10.950 secs (user 8.760+0.000 sys 1.270+0.000)
  time: real 9.450 secs (user 8.530+0.000 sys 0.910+0.000)
  time: real 10.180 secs (user 9.290+0.000 sys 0.880+0.000)

  b. this patch
  time: real 11.890 secs (user 10.740+0.000 sys 1.140+0.000)
  time: real 10.430 secs (user 9.410+0.000 sys 1.020+0.000)
  time: real 10.820 secs (user 9.500+0.000 sys 1.300+0.000)

  c. unbuffered io
  time: real 12.000 secs (user 10.350+0.000 sys 1.640+0.000)
  time: real 11.710 secs (user 10.170+0.000 sys 1.520+0.000)
  time: real 11.750 secs (user 9.900+0.000 sys 1.550+0.000)

  d. base python2 (line-buffered libc stdio)
  time: real 8.170 secs (user 6.990+0.000 sys 1.110+0.000)
  time: real 7.880 secs (user 6.830+0.000 sys 1.040+0.000)
  time: real 7.990 secs (user 6.920+0.000 sys 1.060+0.000)

Number of write() syscalls of the commands above:

  % sudo bpftrace -e 'tracepoint:syscalls:sys_exit_write /comm == "python" \
  || comm == "python3"/ { @dist[pid] = hist(args->ret); @count[pid] = count() }'

  a. base
  @count[252984]: 20810
  @dist[252984]:
  [32, 64)               1 |                                                    |
  [64, 128)              0 |                                                    |
  [128, 256)             0 |                                                    |
  [256, 512)             0 |                                                    |
  [512, 1K)          20346 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [1K, 2K)             462 |@                                                   |
  [2K, 4K)               0 |                                                    |
  [4K, 8K)               0 |                                                    |
  [8K, 16K)              1 |                                                    |

  b. this patch
  @count[252895]: 230220
  @dist[252895]:
  [16, 32)               9 |                                                    |
  [32, 64)           42297 |@@@@@@@@@@@@                                        |
  [64, 128)         170530 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [128, 256)         17384 |@@@@@                                               |

  c. unbuffered io
  @count[252934]: 460441
  @dist[252934]:
  [8, 16)           230220 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [16, 32)            1108 |                                                    |
  [32, 64)           76540 |@@@@@@@@@@@@@@@@@                                   |
  [64, 128)         143892 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
  [128, 256)          8679 |@                                                   |
  [256, 512)             0 |                                                    |
  [512, 1K)              0 |                                                    |
  [1K, 2K)               0 |                                                    |
  [2K, 4K)               0 |                                                    |
  [4K, 8K)               0 |                                                    |
  [8K, 16K)              1 |                                                    |
  [16K, 32K)             0 |                                                    |
  [32K, 64K)             1 |                                                    |

  d. base python2 (line-buffered libc stdio)
  @count[253265]: 230220
  @dist[253265]:
  [16, 32)               9 |                                                    |
  [32, 64)           42297 |@@@@@@@@@@@@                                        |
  [64, 128)         170530 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [128, 256)         17384 |@@@@@                                               |

You can see the number of write() syscalls is doubled for unbuffered io. There
may be a command giving worse result than "hg status", but I don't know.

Benchmark result with non-tty device:

  % HGRCPATH=/dev/null python3 ./hg --cwd ~/mirrors/mozilla-central \
  --time --pager no status --all > /dev/null

  a. base
  time: real 3.300 secs (user 2.830+0.000 sys 0.460+0.000)
  time: real 3.300 secs (user 2.790+0.000 sys 0.500+0.000)
  time: real 3.340 secs (user 2.860+0.000 sys 0.480+0.000)

  b. this patch
  time: real 3.460 secs (user 2.900+0.000 sys 0.550+0.000)
  time: real 3.330 secs (user 2.850+0.000 sys 0.490+0.000)
  time: real 3.320 secs (user 2.880+0.000 sys 0.440+0.000)

So there's no significant performance degradation when stdout isn't a tty.
Joerg Sonnenberger - June 7, 2020, 8:09 p.m.
On Sun, Jun 07, 2020 at 08:55:03PM +0900, Yuya Nishihara wrote:
> With my testing, inserting flush() automatically showed a better result than
> using unbuffered io, so this patch implements (b). Since ui._isatty() call
> isn't cheap, the result is cached. I also made ui.write() not scan the
> entire string because it's unlikely that a string including '\n' doesn't
> end with '\n'.

I'm not sure about the last part. Checking for \n first at the end makes
a lot of sense to me, but I don't think we should micro-optimize away
the scan here. Does it really make that much a difference?

Joerg
Manuel Jacob - June 8, 2020, 6:16 a.m.
On 2020-06-07 13:55, Yuya Nishihara wrote:
> # HG changeset patch
> # User Yuya Nishihara <yuya@tcha.org>
> # Date 1591518045 -32400
> #      Sun Jun 07 17:20:45 2020 +0900
> # Node ID ee3cd52b4b338e05b0a22cb87580d78f1ad7763d
> # Parent  03ba7de6a8b9ab339a6dafab969e890228b4c2f1
> py3: make ui.write() call fout.flush() per line if fout is a tty or 
> pager
> 
> Since sys.stdout.buffer of Python 3 can't be configured as a 
> line-buffered
> stream, we need a workaround:
> 
>  a. make callers do flush() where appropriate
>  b. make ui insert flush() automatically
>  c. use unbuffered stream
> 
> With my testing, inserting flush() automatically showed a better result 
> than
> using unbuffered io, so this patch implements (b). Since ui._isatty() 
> call
> isn't cheap, the result is cached. I also made ui.write() not scan the
> entire string because it's unlikely that a string including '\n' 
> doesn't
> end with '\n'.
> 
> Maybe we can backport a subset of this patch to stable to mitigate the
> ui.status() issue.

All in all, I agree that inserting a flush() per line automatically is 
the best approach for now, as it harmonizes the behavior towards what we 
so far considered to be "the correct behavior" (line-buffered if the 
stdout is a TTY, which works on Python 2 on UNIX).

So far, the implicit contract of ui.write() included that the stream is 
flushed if the string includes a newline. If we backport the patch to 
stable, it should definitively not include the optimization because it 
breaks that contract. In default, we can have the optimization after we 
ensured that all callers continue to work with the new contract. Maybe 
it’s better to completely move the optimization (and the discussion) to 
a follow-up patch.

> Benchmark result:
> 
>   % HGRCPATH=/dev/null python3 ./hg --cwd ~/mirrors/mozilla-central \
>   --time --pager no status --all
> 
>   a. base
>   time: real 10.950 secs (user 8.760+0.000 sys 1.270+0.000)
>   time: real 9.450 secs (user 8.530+0.000 sys 0.910+0.000)
>   time: real 10.180 secs (user 9.290+0.000 sys 0.880+0.000)
> 
>   b. this patch
>   time: real 11.890 secs (user 10.740+0.000 sys 1.140+0.000)
>   time: real 10.430 secs (user 9.410+0.000 sys 1.020+0.000)
>   time: real 10.820 secs (user 9.500+0.000 sys 1.300+0.000)
> 
>   c. unbuffered io
>   time: real 12.000 secs (user 10.350+0.000 sys 1.640+0.000)
>   time: real 11.710 secs (user 10.170+0.000 sys 1.520+0.000)
>   time: real 11.750 secs (user 9.900+0.000 sys 1.550+0.000)
> 
>   d. base python2 (line-buffered libc stdio)
>   time: real 8.170 secs (user 6.990+0.000 sys 1.110+0.000)
>   time: real 7.880 secs (user 6.830+0.000 sys 1.040+0.000)
>   time: real 7.990 secs (user 6.920+0.000 sys 1.060+0.000)

On my machine:

base:
time: real 10.310 secs (user 9.070+0.000 sys 1.200+0.000)
time: real 10.290 secs (user 9.130+0.000 sys 1.040+0.000)
time: real 10.260 secs (user 9.320+0.000 sys 0.920+0.000)

this patch:
time: real 12.850 secs (user 11.300+0.000 sys 1.500+0.000)
time: real 12.960 secs (user 11.300+0.000 sys 1.580+0.000)
time: real 12.510 secs (user 10.900+0.000 sys 1.570+0.000)

unbuffered (my patch):
time: real 13.980 secs (user 11.710+0.000 sys 2.180+0.000)
time: real 14.570 secs (user 12.300+0.000 sys 2.230+0.000)
time: real 13.870 secs (user 11.610+0.000 sys 2.110+0.000)

base on Python 2 (line-buffered libc stdio):
time: real 10.950 secs (user 9.240+0.000 sys 1.670+0.000)
time: real 11.140 secs (user 9.490+0.000 sys 1.630+0.000)
time: real 11.290 secs (user 9.640+0.000 sys 1.620+0.000)

Python 2 is slower on my machine, as it was compiled with less 
aggressive build flags. On Python 3, the results have the same order as 
on your machine, but the differences between them are larger. Overall, I 
think that the performance overhead is justified by the behavior fix.

> Number of write() syscalls of the commands above:
> 
>   % sudo bpftrace -e 'tracepoint:syscalls:sys_exit_write /comm == 
> "python" \
>   || comm == "python3"/ { @dist[pid] = hist(args->ret); @count[pid] = 
> count() }'
> 
>   a. base
>   @count[252984]: 20810
>   @dist[252984]:
>   [32, 64)               1 |
>          |
>   [64, 128)              0 |
>          |
>   [128, 256)             0 |
>          |
>   [256, 512)             0 |
>          |
>   [512, 1K)          20346
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>   [1K, 2K)             462 |@
>          |
>   [2K, 4K)               0 |
>          |
>   [4K, 8K)               0 |
>          |
>   [8K, 16K)              1 |
>          |
> 
>   b. this patch
>   @count[252895]: 230220
>   @dist[252895]:
>   [16, 32)               9 |
>          |
>   [32, 64)           42297 |@@@@@@@@@@@@
>          |
>   [64, 128)         170530
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>   [128, 256)         17384 |@@@@@
>          |
> 
>   c. unbuffered io
>   @count[252934]: 460441
>   @dist[252934]:
>   [8, 16)           230220
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>   [16, 32)            1108 |
>          |
>   [32, 64)           76540 |@@@@@@@@@@@@@@@@@
>          |
>   [64, 128)         143892 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>          |
>   [128, 256)          8679 |@
>          |
>   [256, 512)             0 |
>          |
>   [512, 1K)              0 |
>          |
>   [1K, 2K)               0 |
>          |
>   [2K, 4K)               0 |
>          |
>   [4K, 8K)               0 |
>          |
>   [8K, 16K)              1 |
>          |
>   [16K, 32K)             0 |
>          |
>   [32K, 64K)             1 |
>          |
> 
>   d. base python2 (line-buffered libc stdio)
>   @count[253265]: 230220
>   @dist[253265]:
>   [16, 32)               9 |
>          |
>   [32, 64)           42297 |@@@@@@@@@@@@
>          |
>   [64, 128)         170530
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>   [128, 256)         17384 |@@@@@
>          |
> 
> You can see the number of write() syscalls is doubled for unbuffered 
> io. There
> may be a command giving worse result than "hg status", but I don't 
> know.
> 
> Benchmark result with non-tty device:
> 
>   % HGRCPATH=/dev/null python3 ./hg --cwd ~/mirrors/mozilla-central \
>   --time --pager no status --all > /dev/null
> 
>   a. base
>   time: real 3.300 secs (user 2.830+0.000 sys 0.460+0.000)
>   time: real 3.300 secs (user 2.790+0.000 sys 0.500+0.000)
>   time: real 3.340 secs (user 2.860+0.000 sys 0.480+0.000)
> 
>   b. this patch
>   time: real 3.460 secs (user 2.900+0.000 sys 0.550+0.000)
>   time: real 3.330 secs (user 2.850+0.000 sys 0.490+0.000)
>   time: real 3.320 secs (user 2.880+0.000 sys 0.440+0.000)
> 
> So there's no significant performance degradation when stdout isn't a 
> tty.
> 
> diff --git a/mercurial/ui.py b/mercurial/ui.py
> --- a/mercurial/ui.py
> +++ b/mercurial/ui.py
> @@ -563,6 +563,8 @@ class ui(object):
>              )
>              self.tracebackflag = self.configbool(b'ui', b'traceback')
>              self.logblockedtimes = self.configbool(b'ui', 
> b'logblockedtimes')
> +            # clear cache which depends on ui.nontty
> +            self.__dict__.pop('_fout_needs_flush_per_line', None)
> 
>          if section in (None, b'trusted'):
>              # update trust information
> @@ -1026,6 +1028,7 @@ class ui(object):
>      def fout(self, f):
>          self._fout = f
>          self._fmsgout, self._fmsgerr = _selectmsgdests(self)
> +        self.__dict__.pop('_fout_needs_flush_per_line', None)
> 
>      @property
>      def ferr(self):
> @@ -1153,6 +1156,9 @@ class ui(object):
>                      label = opts.get('label', b'')
>                      msg = self.label(msg, label)
>                  dest.write(msg)
> +            # bytes stream can't be line-buffered on Python 3.
> +            if self._fout_needs_flush_per_line and 
> msg.endswith(b'\n'):
> +                dest.flush()
>          except IOError as err:
>              raise error.StdioError(err)
>          finally:
> @@ -1202,6 +1208,13 @@ class ui(object):
>              # including stdout.
>              if dest is self._ferr and not getattr(self._ferr, 
> 'closed', False):
>                  dest.flush()
> +            # bytes stream can't be line-buffered on Python 3.
> +            elif (
> +                dest is self._fout
> +                and self._fout_needs_flush_per_line
> +                and msg.endswith(b'\n')
> +            ):
> +                dest.flush()
>          except IOError as err:
>              if dest is self._ferr and err.errno in (
>                  errno.EPIPE,
> @@ -1242,6 +1255,12 @@ class ui(object):
>                  util.timer() - starttime
>              ) * 1000
> 
> +    @util.propertycache
> +    def _fout_needs_flush_per_line(self):
> +        """True if fout should be line-buffered and needs explicit 
> flush() to
> +        behave as such"""
> +        return pycompat.ispy3 and (self.pageractive or
> self._isatty(self._fout))

This implements a behavior that might differ from that on Python 2. On 
Python 2 on non-Windows systems, procutil opens stdout in line-buffered 
mode if stdout is a TTY, no matter whether ui.nontty is set or not. I’m 
not saying that this is the desirable behavior, but I’d like to have 
consistent behavior across Python versions.

The fact that we have to check `self.pageractive` shows that it is risky 
to compute `self._fout_needs_flush_per_line` lazily, as TTY-ness of a FD 
can change over time but for the purposes of checking whether the output 
should line-buffered, we have to check the TTY-ness at specific times. 
The case when `self.pageractive` is True might be the only case where 
the TTY-ness of the FD is off, but I think we should avoid that we have 
to exhaustively enumerate all these cases.

In procutil, we already have a place where we know whether stdout should 
be line-buffered and whether the line-buffering is not provided by the 
stream object (although the current code is buggy). We should avoid 
duplicating that logic. In general, we can assume that places that 
create stream objects know if line-buffering should be used but is not 
supported (e.g. open() nowadays prints a warning).

The question is how to communicate the information that the stream 
should be flushed per line into the ui object. One way would be to set a 
`_needs_flush_per_line` attribute on the stream.

An alternative implementation approach would be to push the flushing out 
of ui by creating a wrapper around binary streams in procutil that 
implements the line buffering. It has the advantage that other users of 
procutil.stdout would also automatically flush correctly. This will have 
a small overhead, but it will likely end up in the noise and it doesn’t 
fundamentally change performance characteristics like unbuffered does. 
The code for the non-TTY case could stay untouched with this approach.

I can implement and benchmark these solutions, but I’m not sure I’ll 
have the time later today.

> +
>      def _isatty(self, fh):
>          if self.configbool(b'ui', b'nontty'):
>              return False
> @@ -1260,6 +1279,8 @@ class ui(object):
>              return self._fin, self._fout
>          fin, fout = procutil.protectstdio(self._fin, self._fout)
>          self._finoutredirected = (fin, fout) != (self._fin, 
> self._fout)
> +        # clear cache which depends on isatty(fout)
> +        self.__dict__.pop('_fout_needs_flush_per_line', None)
>          return fin, fout
> 
>      def restorefinout(self, fin, fout):
> @@ -1270,6 +1291,8 @@ class ui(object):
>          # protectfinout() won't create more than one duplicated 
> streams,
>          # so we can just turn the redirection flag off.
>          self._finoutredirected = False
> +        # clear cache which depends on isatty(fout)
> +        self.__dict__.pop('_fout_needs_flush_per_line', None)
> 
>      @contextlib.contextmanager
>      def protectedfinout(self):
> @@ -1344,6 +1367,8 @@ class ui(object):
>              # auto-detection of things being formatted.
>              self.setconfig(b'ui', b'formatted', wasformatted, 
> b'pager')
>              self.setconfig(b'ui', b'interactive', False, b'pager')
> +            # clear cache which depends on pageractive
> +            self.__dict__.pop('_fout_needs_flush_per_line', None)
> 
>              # If pagermode differs from color.mode, reconfigure color 
> now that
>              # pageractive is set.
> diff --git a/mercurial/utils/procutil.py b/mercurial/utils/procutil.py
> --- a/mercurial/utils/procutil.py
> +++ b/mercurial/utils/procutil.py
> @@ -57,8 +57,7 @@ if isatty(stdout):
>          # Windows doesn't support line buffering
>          stdout = os.fdopen(stdout.fileno(), 'wb', 0)
>      elif not pycompat.ispy3:
> -        # on Python 3, stdout (sys.stdout.buffer) is already line 
> buffered and
> -        # buffering=1 is not handled in binary mode
> +        # on Python 3, there's no way to make bytes stream 
> line-buffered
>          stdout = os.fdopen(stdout.fileno(), 'wb', 1)
> 
>  if pycompat.iswindows:
Yuya Nishihara - June 8, 2020, 12:11 p.m.
On Sun, 7 Jun 2020 22:09:58 +0200, Joerg Sonnenberger wrote:
> On Sun, Jun 07, 2020 at 08:55:03PM +0900, Yuya Nishihara wrote:
> > With my testing, inserting flush() automatically showed a better result than
> > using unbuffered io, so this patch implements (b). Since ui._isatty() call
> > isn't cheap, the result is cached. I also made ui.write() not scan the
> > entire string because it's unlikely that a string including '\n' doesn't
> > end with '\n'.
> 
> I'm not sure about the last part. Checking for \n first at the end makes
> a lot of sense to me, but I don't think we should micro-optimize away
> the scan here. Does it really make that much a difference?

I submitted this patch without testing that, but actually msg.rfind(b'\n')
is slightly slower than msg.endswith(b'\n').

% HGRCPATH=/dev/null python3 ./hg --cwd ~/mirrors/mozilla-central \
--time --pager no status --all

msg.endswith(b'\n')
time: real 12.930 secs (user 11.580+0.000 sys 1.350+0.000)
time: real 12.800 secs (user 11.450+0.000 sys 1.350+0.000)
time: real 12.790 secs (user 11.360+0.000 sys 1.420+0.000)

msg.rfind(b'\n')
time: real 13.470 secs (user 11.790+0.000 sys 1.650+0.000)
time: real 14.520 secs (user 12.550+0.000 sys 1.760+0.000)
time: real 13.440 secs (user 11.640+0.000 sys 1.770+0.000)

And dest.flush() will have to be inserted inbetween write()s.

p = msg.rfind(b'\n')
dest.write(msg[:p + 1])
dest.flush()
dest.write(msg[p + 1:])

Since ui isn't an IO object, I think checking the common case should be
good.
Yuya Nishihara - June 8, 2020, 12:13 p.m.
On Mon, 08 Jun 2020 08:16:13 +0200, Manuel Jacob wrote:
> On 2020-06-07 13:55, Yuya Nishihara wrote:
> > # HG changeset patch
> > # User Yuya Nishihara <yuya@tcha.org>
> > # Date 1591518045 -32400
> > #      Sun Jun 07 17:20:45 2020 +0900
> > # Node ID ee3cd52b4b338e05b0a22cb87580d78f1ad7763d
> > # Parent  03ba7de6a8b9ab339a6dafab969e890228b4c2f1
> > py3: make ui.write() call fout.flush() per line if fout is a tty or 
> > pager
> > 
> > Since sys.stdout.buffer of Python 3 can't be configured as a 
> > line-buffered
> > stream, we need a workaround:
> > 
> >  a. make callers do flush() where appropriate
> >  b. make ui insert flush() automatically
> >  c. use unbuffered stream
> > 
> > With my testing, inserting flush() automatically showed a better result 
> > than
> > using unbuffered io, so this patch implements (b). Since ui._isatty() 
> > call
> > isn't cheap, the result is cached. I also made ui.write() not scan the
> > entire string because it's unlikely that a string including '\n' 
> > doesn't
> > end with '\n'.
> > 
> > Maybe we can backport a subset of this patch to stable to mitigate the
> > ui.status() issue.
> 
> All in all, I agree that inserting a flush() per line automatically is 
> the best approach for now, as it harmonizes the behavior towards what we 
> so far considered to be "the correct behavior" (line-buffered if the 
> stdout is a TTY, which works on Python 2 on UNIX).
> 
> So far, the implicit contract of ui.write() included that the stream is 
> flushed if the string includes a newline. If we backport the patch to 
> stable, it should definitively not include the optimization because it 
> breaks that contract. In default, we can have the optimization after we 
> ensured that all callers continue to work with the new contract. Maybe 
> it’s better to completely move the optimization (and the discussion) to 
> a follow-up patch.

From my experience of making hg slow due to ui.write() change, I would
say it's unacceptable to implement unoptimized ui.write(). For stable,
I think fixing ui.status()/ui.note()/ui.debug() should be enough.

> > +    @util.propertycache
> > +    def _fout_needs_flush_per_line(self):
> > +        """True if fout should be line-buffered and needs explicit 
> > flush() to
> > +        behave as such"""
> > +        return pycompat.ispy3 and (self.pageractive or
> > self._isatty(self._fout))
> 
> This implements a behavior that might differ from that on Python 2. On 
> Python 2 on non-Windows systems, procutil opens stdout in line-buffered 
> mode if stdout is a TTY, no matter whether ui.nontty is set or not. I’m 
> not saying that this is the desirable behavior, but I’d like to have 
> consistent behavior across Python versions.

I don't think that matters. ui.nontty is an internal flag to disable
pager, readline, etc. I prefer making ui code use ui._isatty() consistently.

> The fact that we have to check `self.pageractive` shows that it is risky 
> to compute `self._fout_needs_flush_per_line` lazily,

Good point.

> The question is how to communicate the information that the stream 
> should be flushed per line into the ui object. One way would be to set a 
> `_needs_flush_per_line` attribute on the stream.
>
> An alternative implementation approach would be to push the flushing out 
> of ui by creating a wrapper around binary streams in procutil that 
> implements the line buffering. It has the advantage that other users of 
> procutil.stdout would also automatically flush correctly. This will have 
> a small overhead, but it will likely end up in the noise and it doesn’t 
> fundamentally change performance characteristics like unbuffered does.
> The code for the non-TTY case could stay untouched with this approach.

I generally avoid setting arbitrary property to random object, and I
don't wanna reimplement full-blown line-buffered bytes io.

> I can implement and benchmark these solutions, but I’m not sure I’ll
> have the time later today.

Thanks.
Yuya Nishihara - June 8, 2020, 12:52 p.m.
On Sun, 07 Jun 2020 20:55:03 +0900, Yuya Nishihara wrote:
>   % HGRCPATH=/dev/null python3 ./hg --cwd ~/mirrors/mozilla-central \
>   --time --pager no status --all

Maybe I should use pager to redirect output to /dev/null while keeping
stdout line-buffered.

  % HGRCPATH=/dev/null python3 ./hg --cwd ~/mirrors/mozilla-central \
  --time --pager yes --config pager.pager='cat > /dev/null' status --all
Joerg Sonnenberger - June 8, 2020, 12:59 p.m.
On Mon, Jun 08, 2020 at 09:11:48PM +0900, Yuya Nishihara wrote:
> On Sun, 7 Jun 2020 22:09:58 +0200, Joerg Sonnenberger wrote:
> > On Sun, Jun 07, 2020 at 08:55:03PM +0900, Yuya Nishihara wrote:
> > > With my testing, inserting flush() automatically showed a better result than
> > > using unbuffered io, so this patch implements (b). Since ui._isatty() call
> > > isn't cheap, the result is cached. I also made ui.write() not scan the
> > > entire string because it's unlikely that a string including '\n' doesn't
> > > end with '\n'.
> > 
> > I'm not sure about the last part. Checking for \n first at the end makes
> > a lot of sense to me, but I don't think we should micro-optimize away
> > the scan here. Does it really make that much a difference?
> 
> I submitted this patch without testing that, but actually msg.rfind(b'\n')
> is slightly slower than msg.endswith(b'\n').
> 
> % HGRCPATH=/dev/null python3 ./hg --cwd ~/mirrors/mozilla-central \
> --time --pager no status --all
> 
> msg.endswith(b'\n')
> time: real 12.930 secs (user 11.580+0.000 sys 1.350+0.000)
> time: real 12.800 secs (user 11.450+0.000 sys 1.350+0.000)
> time: real 12.790 secs (user 11.360+0.000 sys 1.420+0.000)
> 
> msg.rfind(b'\n')
> time: real 13.470 secs (user 11.790+0.000 sys 1.650+0.000)
> time: real 14.520 secs (user 12.550+0.000 sys 1.760+0.000)
> time: real 13.440 secs (user 11.640+0.000 sys 1.770+0.000)
> 
> And dest.flush() will have to be inserted inbetween write()s.
> 
> p = msg.rfind(b'\n')
> dest.write(msg[:p + 1])
> dest.flush()
> dest.write(msg[p + 1:])
> 
> Since ui isn't an IO object, I think checking the common case should be
> good.

That's not what I meant. I was thinking along the lines of:

    dest.write(msg)
    if msg.endswith(b'\n') or b'\n' in msg:
        dest.flush()

I don't see a reason for flushing in the middle as long as we flush at
all.

Joerg
Yuya Nishihara - June 8, 2020, 1:35 p.m.
On Mon, 8 Jun 2020 14:59:40 +0200, Joerg Sonnenberger wrote:
> On Mon, Jun 08, 2020 at 09:11:48PM +0900, Yuya Nishihara wrote:
> > On Sun, 7 Jun 2020 22:09:58 +0200, Joerg Sonnenberger wrote:
> > > On Sun, Jun 07, 2020 at 08:55:03PM +0900, Yuya Nishihara wrote:
> > > > With my testing, inserting flush() automatically showed a better result than
> > > > using unbuffered io, so this patch implements (b). Since ui._isatty() call
> > > > isn't cheap, the result is cached. I also made ui.write() not scan the
> > > > entire string because it's unlikely that a string including '\n' doesn't
> > > > end with '\n'.
> > > 
> > > I'm not sure about the last part. Checking for \n first at the end makes
> > > a lot of sense to me, but I don't think we should micro-optimize away
> > > the scan here. Does it really make that much a difference?
> > 
> > I submitted this patch without testing that, but actually msg.rfind(b'\n')
> > is slightly slower than msg.endswith(b'\n').
> > 
> > % HGRCPATH=/dev/null python3 ./hg --cwd ~/mirrors/mozilla-central \
> > --time --pager no status --all
> > 
> > msg.endswith(b'\n')
> > time: real 12.930 secs (user 11.580+0.000 sys 1.350+0.000)
> > time: real 12.800 secs (user 11.450+0.000 sys 1.350+0.000)
> > time: real 12.790 secs (user 11.360+0.000 sys 1.420+0.000)
> > 
> > msg.rfind(b'\n')
> > time: real 13.470 secs (user 11.790+0.000 sys 1.650+0.000)
> > time: real 14.520 secs (user 12.550+0.000 sys 1.760+0.000)
> > time: real 13.440 secs (user 11.640+0.000 sys 1.770+0.000)
> > 
> > And dest.flush() will have to be inserted inbetween write()s.
> > 
> > p = msg.rfind(b'\n')
> > dest.write(msg[:p + 1])
> > dest.flush()
> > dest.write(msg[p + 1:])
> > 
> > Since ui isn't an IO object, I think checking the common case should be
> > good.
> 
> That's not what I meant. I was thinking along the lines of:
> 
>     dest.write(msg)
>     if msg.endswith(b'\n') or b'\n' in msg:
>         dest.flush()
> 
> I don't see a reason for flushing in the middle as long as we flush at
> all.

Well, that's the line-buffered stdio behavior.

  % python -c 'import sys, time; sys.stdout.write("foo\nbar"); time.sleep(1)'
Augie Fackler - June 9, 2020, 9:30 p.m.
On Sun, Jun 07, 2020 at 08:55:03PM +0900, Yuya Nishihara wrote:
> # HG changeset patch
> # User Yuya Nishihara <yuya@tcha.org>
> # Date 1591518045 -32400
> #      Sun Jun 07 17:20:45 2020 +0900
> # Node ID ee3cd52b4b338e05b0a22cb87580d78f1ad7763d
> # Parent  03ba7de6a8b9ab339a6dafab969e890228b4c2f1
> py3: make ui.write() call fout.flush() per line if fout is a tty or pager

I'm also fine with this, but I think I prefer mjacob's solution.

Patch

diff --git a/mercurial/ui.py b/mercurial/ui.py
--- a/mercurial/ui.py
+++ b/mercurial/ui.py
@@ -563,6 +563,8 @@  class ui(object):
             )
             self.tracebackflag = self.configbool(b'ui', b'traceback')
             self.logblockedtimes = self.configbool(b'ui', b'logblockedtimes')
+            # clear cache which depends on ui.nontty
+            self.__dict__.pop('_fout_needs_flush_per_line', None)
 
         if section in (None, b'trusted'):
             # update trust information
@@ -1026,6 +1028,7 @@  class ui(object):
     def fout(self, f):
         self._fout = f
         self._fmsgout, self._fmsgerr = _selectmsgdests(self)
+        self.__dict__.pop('_fout_needs_flush_per_line', None)
 
     @property
     def ferr(self):
@@ -1153,6 +1156,9 @@  class ui(object):
                     label = opts.get('label', b'')
                     msg = self.label(msg, label)
                 dest.write(msg)
+            # bytes stream can't be line-buffered on Python 3.
+            if self._fout_needs_flush_per_line and msg.endswith(b'\n'):
+                dest.flush()
         except IOError as err:
             raise error.StdioError(err)
         finally:
@@ -1202,6 +1208,13 @@  class ui(object):
             # including stdout.
             if dest is self._ferr and not getattr(self._ferr, 'closed', False):
                 dest.flush()
+            # bytes stream can't be line-buffered on Python 3.
+            elif (
+                dest is self._fout
+                and self._fout_needs_flush_per_line
+                and msg.endswith(b'\n')
+            ):
+                dest.flush()
         except IOError as err:
             if dest is self._ferr and err.errno in (
                 errno.EPIPE,
@@ -1242,6 +1255,12 @@  class ui(object):
                 util.timer() - starttime
             ) * 1000
 
+    @util.propertycache
+    def _fout_needs_flush_per_line(self):
+        """True if fout should be line-buffered and needs explicit flush() to
+        behave as such"""
+        return pycompat.ispy3 and (self.pageractive or self._isatty(self._fout))
+
     def _isatty(self, fh):
         if self.configbool(b'ui', b'nontty'):
             return False
@@ -1260,6 +1279,8 @@  class ui(object):
             return self._fin, self._fout
         fin, fout = procutil.protectstdio(self._fin, self._fout)
         self._finoutredirected = (fin, fout) != (self._fin, self._fout)
+        # clear cache which depends on isatty(fout)
+        self.__dict__.pop('_fout_needs_flush_per_line', None)
         return fin, fout
 
     def restorefinout(self, fin, fout):
@@ -1270,6 +1291,8 @@  class ui(object):
         # protectfinout() won't create more than one duplicated streams,
         # so we can just turn the redirection flag off.
         self._finoutredirected = False
+        # clear cache which depends on isatty(fout)
+        self.__dict__.pop('_fout_needs_flush_per_line', None)
 
     @contextlib.contextmanager
     def protectedfinout(self):
@@ -1344,6 +1367,8 @@  class ui(object):
             # auto-detection of things being formatted.
             self.setconfig(b'ui', b'formatted', wasformatted, b'pager')
             self.setconfig(b'ui', b'interactive', False, b'pager')
+            # clear cache which depends on pageractive
+            self.__dict__.pop('_fout_needs_flush_per_line', None)
 
             # If pagermode differs from color.mode, reconfigure color now that
             # pageractive is set.
diff --git a/mercurial/utils/procutil.py b/mercurial/utils/procutil.py
--- a/mercurial/utils/procutil.py
+++ b/mercurial/utils/procutil.py
@@ -57,8 +57,7 @@  if isatty(stdout):
         # Windows doesn't support line buffering
         stdout = os.fdopen(stdout.fileno(), 'wb', 0)
     elif not pycompat.ispy3:
-        # on Python 3, stdout (sys.stdout.buffer) is already line buffered and
-        # buffering=1 is not handled in binary mode
+        # on Python 3, there's no way to make bytes stream line-buffered
         stdout = os.fdopen(stdout.fileno(), 'wb', 1)
 
 if pycompat.iswindows: