Patchwork [2,of,3,side-word,(+4)] sshpeer: use a 'bufferedinputpipe' for standard output of the ssh process

login
register
mail settings
Submitter Pierre-Yves David
Date May 31, 2015, 7:22 a.m.
Message ID <72344e96acd26c68121d.1433056953@marginatus.alto.octopoid.net>
Download mbox | patch
Permalink /patch/9408/
State Accepted
Commit e461230cc95b29489f6dcdb5739d4434cf9a8a00
Headers show

Comments

Pierre-Yves David - May 31, 2015, 7:22 a.m.
# HG changeset patch
# User Pierre-Yves David <pierre-yves.david@fb.com>
# Date 1433055636 25200
#      Sun May 31 00:00:36 2015 -0700
# Node ID 72344e96acd26c68121d747aeb77fb8e1f9a75e8
# Parent  1a883412355af1e057324b4ab2de885473f54432
sshpeer: use a 'bufferedinputpipe' for standard output of the ssh process

We need this pipe to still be buffered when will switch to unbuffered pipe.
(switch motivated by the need of using polling to restore real time output from
ssh server). This is the only pipe that needs to be wrapped because this is the
one who do extensive usage of 'readline'. The stderr pipe of the process is
alway read in non blocking raw chunk, so it won't benefit from the
buffering.
Yuya Nishihara - May 31, 2015, 2:10 p.m.
On Sun, 31 May 2015 00:22:33 -0700, Pierre-Yves David wrote:
> # HG changeset patch
> # User Pierre-Yves David <pierre-yves.david@fb.com>
> # Date 1433055636 25200
> #      Sun May 31 00:00:36 2015 -0700
> # Node ID 72344e96acd26c68121d747aeb77fb8e1f9a75e8
> # Parent  1a883412355af1e057324b4ab2de885473f54432
> sshpeer: use a 'bufferedinputpipe' for standard output of the ssh process
> 
> We need this pipe to still be buffered when will switch to unbuffered pipe.
> (switch motivated by the need of using polling to restore real time output from
> ssh server). This is the only pipe that needs to be wrapped because this is the
> one who do extensive usage of 'readline'. The stderr pipe of the process is
> alway read in non blocking raw chunk, so it won't benefit from the
> buffering.
> 
> diff --git a/mercurial/sshpeer.py b/mercurial/sshpeer.py
> --- a/mercurial/sshpeer.py
> +++ b/mercurial/sshpeer.py
> @@ -87,10 +87,12 @@ class sshpeer(wireproto.wirepeer):
>  
>          # while self.subprocess isn't used, having it allows the subprocess to
>          # to clean up correctly later
>          self.pipeo, self.pipei, self.pipee, self.subprocess = util.popen4(cmd)
>  
> +        self.pipei = util.bufferedinputpipe(self.pipei)

Can't we do dup() + fdopen() or PyFile_SetBufSize() to make only pipei buffered?

I don't understand what's going to happen with select(), so maybe I miss the
point of new "hasbuffer" property.

Regards,
Pierre-Yves David - May 31, 2015, 7:51 p.m.
On 05/31/2015 07:10 AM, Yuya Nishihara wrote:
> On Sun, 31 May 2015 00:22:33 -0700, Pierre-Yves David wrote:
>> # HG changeset patch
>> # User Pierre-Yves David <pierre-yves.david@fb.com>
>> # Date 1433055636 25200
>> #      Sun May 31 00:00:36 2015 -0700
>> # Node ID 72344e96acd26c68121d747aeb77fb8e1f9a75e8
>> # Parent  1a883412355af1e057324b4ab2de885473f54432
>> sshpeer: use a 'bufferedinputpipe' for standard output of the ssh process
>>
>> We need this pipe to still be buffered when will switch to unbuffered pipe.
>> (switch motivated by the need of using polling to restore real time output from
>> ssh server). This is the only pipe that needs to be wrapped because this is the
>> one who do extensive usage of 'readline'. The stderr pipe of the process is
>> alway read in non blocking raw chunk, so it won't benefit from the
>> buffering.
>>
>> diff --git a/mercurial/sshpeer.py b/mercurial/sshpeer.py
>> --- a/mercurial/sshpeer.py
>> +++ b/mercurial/sshpeer.py
>> @@ -87,10 +87,12 @@ class sshpeer(wireproto.wirepeer):
>>
>>           # while self.subprocess isn't used, having it allows the subprocess to
>>           # to clean up correctly later
>>           self.pipeo, self.pipei, self.pipee, self.subprocess = util.popen4(cmd)
>>
>> +        self.pipei = util.bufferedinputpipe(self.pipei)
>
> Can't we do dup() + fdopen() or PyFile_SetBufSize() to make only pipei buffered?
>
> I don't understand what's going to happen with select(), so maybe I miss the
> point of new "hasbuffer" property.

Not we cannot, we need to have access to the buffer state to do our 
polling IO.

The polling IO are used to wait on both stdout and stderr. This allow to 
detect and process server output (stderr) while waiting on protocol data 
(stdout)

The issue with the buffer+select is that we want to be able to perform 
such operation:

  0) ssh: write "datadata"
  1) hg:  select() -> hey! data to read
  2) hg:  read(4) -> "data"
  3) hg:  select() -> hey! data to read
  4) hg:  read(4) -> "data"

If you add the buffering, the 2) will actually read all available data 
in the buffer the second half. (3) will not see any data to read and 
block for ever (4) will never run (5) the internet kitten will never 
have his bowl filled.

You can see the select usage here: 
http://42.netv6.net/marmoute-wip/mercurial/rev/b9213af62475
Yuya Nishihara - May 31, 2015, 11:01 p.m.
On Sun, 31 May 2015 12:51:32 -0700, Pierre-Yves David wrote:
> On 05/31/2015 07:10 AM, Yuya Nishihara wrote:
> > On Sun, 31 May 2015 00:22:33 -0700, Pierre-Yves David wrote:
> >> # HG changeset patch
> >> # User Pierre-Yves David <pierre-yves.david@fb.com>
> >> # Date 1433055636 25200
> >> #      Sun May 31 00:00:36 2015 -0700
> >> # Node ID 72344e96acd26c68121d747aeb77fb8e1f9a75e8
> >> # Parent  1a883412355af1e057324b4ab2de885473f54432
> >> sshpeer: use a 'bufferedinputpipe' for standard output of the ssh process
> >>
> >> We need this pipe to still be buffered when will switch to unbuffered pipe.
> >> (switch motivated by the need of using polling to restore real time output from
> >> ssh server). This is the only pipe that needs to be wrapped because this is the
> >> one who do extensive usage of 'readline'. The stderr pipe of the process is
> >> alway read in non blocking raw chunk, so it won't benefit from the
> >> buffering.
> >>
> >> diff --git a/mercurial/sshpeer.py b/mercurial/sshpeer.py
> >> --- a/mercurial/sshpeer.py
> >> +++ b/mercurial/sshpeer.py
> >> @@ -87,10 +87,12 @@ class sshpeer(wireproto.wirepeer):
> >>
> >>           # while self.subprocess isn't used, having it allows the subprocess to
> >>           # to clean up correctly later
> >>           self.pipeo, self.pipei, self.pipee, self.subprocess = util.popen4(cmd)
> >>
> >> +        self.pipei = util.bufferedinputpipe(self.pipei)
> >
> > Can't we do dup() + fdopen() or PyFile_SetBufSize() to make only pipei buffered?
> >
> > I don't understand what's going to happen with select(), so maybe I miss the
> > point of new "hasbuffer" property.
> 
> Not we cannot, we need to have access to the buffer state to do our 
> polling IO.
> 
> The polling IO are used to wait on both stdout and stderr. This allow to 
> detect and process server output (stderr) while waiting on protocol data 
> (stdout)
> 
> The issue with the buffer+select is that we want to be able to perform 
> such operation:
> 
>   0) ssh: write "datadata"
>   1) hg:  select() -> hey! data to read
>   2) hg:  read(4) -> "data"
>   3) hg:  select() -> hey! data to read
>   4) hg:  read(4) -> "data"
> 
> If you add the buffering, the 2) will actually read all available data 
> in the buffer the second half. (3) will not see any data to read and 
> block for ever (4) will never run (5) the internet kitten will never 
> have his bowl filled.
> 
> You can see the select usage here: 
> http://42.netv6.net/marmoute-wip/mercurial/rev/b9213af62475

Thanks, I got it. We want to provide synchronous file-like interface for the
main stream, but still want to do asynchronous read for err stream as possible.
Yuya Nishihara - June 1, 2015, 3:16 p.m.
On Mon, 1 Jun 2015 08:01:55 +0900, Yuya Nishihara wrote:
> On Sun, 31 May 2015 12:51:32 -0700, Pierre-Yves David wrote:
> > On 05/31/2015 07:10 AM, Yuya Nishihara wrote:
> > > On Sun, 31 May 2015 00:22:33 -0700, Pierre-Yves David wrote:
> > The issue with the buffer+select is that we want to be able to perform 
> > such operation:
> > 
> >   0) ssh: write "datadata"
> >   1) hg:  select() -> hey! data to read
> >   2) hg:  read(4) -> "data"
> >   3) hg:  select() -> hey! data to read
> >   4) hg:  read(4) -> "data"
> > 
> > If you add the buffering, the 2) will actually read all available data 
> > in the buffer the second half. (3) will not see any data to read and 
> > block for ever (4) will never run (5) the internet kitten will never 
> > have his bowl filled.
> > 
> > You can see the select usage here:
> > http://42.netv6.net/marmoute-wip/mercurial/rev/b9213af62475

Just an idea: because doublepipe requires main channel is a bufferedinputpipe,
we won't need two file-like classes. They could be simpler if

 - bufferedinputpipe is a fifo of string, like deque, not file-like
 - and doublepipe is responsible for all read operations.

For example,

    class doublepipe(object):
        def _wait(self):
            if self._mainfifo:
                return (True, True)
            select()
            ...

        def _call(self, methname, ...):
            ...
            while True:
                mainready, sideready = self._wait()
                ...
                if len(self._mainfifo) >= size:  # methname == 'read'
                    return self._mainfifo.pop(size)
                if mainready:
                    s = os.read(self._main.fileno(), chunksize)
                    if not s:
                        return self._maininfo.pop()  # EOF
                    self._mainfifo.push(s)

This will slightly complicate doublepipe, so I'm not sure which is better.

Regards,
Pierre-Yves David - June 1, 2015, 4:12 p.m.
On 06/01/2015 08:16 AM, Yuya Nishihara wrote:
> On Mon, 1 Jun 2015 08:01:55 +0900, Yuya Nishihara wrote:
>> On Sun, 31 May 2015 12:51:32 -0700, Pierre-Yves David wrote:
>>> On 05/31/2015 07:10 AM, Yuya Nishihara wrote:
>>>> On Sun, 31 May 2015 00:22:33 -0700, Pierre-Yves David wrote:
>>> The issue with the buffer+select is that we want to be able to perform
>>> such operation:
>>>
>>>    0) ssh: write "datadata"
>>>    1) hg:  select() -> hey! data to read
>>>    2) hg:  read(4) -> "data"
>>>    3) hg:  select() -> hey! data to read
>>>    4) hg:  read(4) -> "data"
>>>
>>> If you add the buffering, the 2) will actually read all available data
>>> in the buffer the second half. (3) will not see any data to read and
>>> block for ever (4) will never run (5) the internet kitten will never
>>> have his bowl filled.
>>>
>>> You can see the select usage here:
>>> http://42.netv6.net/marmoute-wip/mercurial/rev/b9213af62475
>
> Just an idea: because doublepipe requires main channel is a bufferedinputpipe,
> we won't need two file-like classes. They could be simpler if
>
>   - bufferedinputpipe is a fifo of string, like deque, not file-like
>   - and doublepipe is responsible for all read operations.
>
> For example,
>
>      class doublepipe(object):
>          def _wait(self):
>              if self._mainfifo:
>                  return (True, True)
>              select()
>              ...
>
>          def _call(self, methname, ...):
>              ...
>              while True:
>                  mainready, sideready = self._wait()
>                  ...
>                  if len(self._mainfifo) >= size:  # methname == 'read'
>                      return self._mainfifo.pop(size)
>                  if mainready:
>                      s = os.read(self._main.fileno(), chunksize)
>                      if not s:
>                          return self._maininfo.pop()  # EOF
>                      self._mainfifo.push(s)
>
> This will slightly complicate doublepipe, so I'm not sure which is better.

The original version was doing all in one. However, the buffer version 
using the 'os' module is isolated into the 'util' module (we avoid any 
'os' usage in any top module) and the douple pipe which enforce the ssh 
specific logic of "forward stderr as 'remote:' output" live in the ssh 
specific 'sshpeer' module.

Do you want me to update the commit message?

(note: at some point we will replace all this with thread).
Augie Fackler - June 1, 2015, 11:57 p.m.
On Mon, Jun 01, 2015 at 09:12:31AM -0700, Pierre-Yves David wrote:
>
>
> On 06/01/2015 08:16 AM, Yuya Nishihara wrote:
> >On Mon, 1 Jun 2015 08:01:55 +0900, Yuya Nishihara wrote:
> >This will slightly complicate doublepipe, so I'm not sure which is better.
>
> The original version was doing all in one. However, the buffer version using
> the 'os' module is isolated into the 'util' module (we avoid any 'os' usage
> in any top module) and the douple pipe which enforce the ssh specific logic
> of "forward stderr as 'remote:' output" live in the ssh specific 'sshpeer'
> module.
>
> Do you want me to update the commit message?

I think that might help future code archaeologists, so let's go for yes?

At this point we've got enough going on I'm having trouble keeping up
with all the churn at the protocol-ish layers, so I'm sure in two
years we'll be grateful for overly-verbose commit messages.

>
> (note: at some point we will replace all this with thread).

Would we be better off to just code that up now? It might be
significantly simpler...

>
> --
> Pierre-Yves David
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> https://selenic.com/mailman/listinfo/mercurial-devel
Pierre-Yves David - June 2, 2015, 12:03 a.m.
On 06/01/2015 04:57 PM, Augie Fackler wrote:
> On Mon, Jun 01, 2015 at 09:12:31AM -0700, Pierre-Yves David wrote:
>>
>>
>> On 06/01/2015 08:16 AM, Yuya Nishihara wrote:
>>> On Mon, 1 Jun 2015 08:01:55 +0900, Yuya Nishihara wrote:
>>> This will slightly complicate doublepipe, so I'm not sure which is better.
>>
>> The original version was doing all in one. However, the buffer version using
>> the 'os' module is isolated into the 'util' module (we avoid any 'os' usage
>> in any top module) and the douple pipe which enforce the ssh specific logic
>> of "forward stderr as 'remote:' output" live in the ssh specific 'sshpeer'
>> module.
>>
>> Do you want me to update the commit message?
>
> I think that might help future code archaeologists, so let's go for yes?
>
> At this point we've got enough going on I'm having trouble keeping up
> with all the churn at the protocol-ish layers, so I'm sure in two
> years we'll be grateful for overly-verbose commit messages.

Do you want that at documentation for each class as a details in the 
description for the each commit introducting them ?

>> (note: at some point we will replace all this with thread).
>
> Would we be better off to just code that up now? It might be
> significantly simpler...

The current approach just replace on of the core object with something 
that behave exactly the same but also unsure the server output is 
consume in real time. This have ø impact on the rest of the code. The 
object itself is non trivial but the impact on the code is ultra simple. 
translating all that to threading have a much bigger impact and will be 
much more complexe. We'll eventually get to there for bundle2 but I 
would be happy to not be rabbit-holed with a massive refactor for a bug 
fix that is definitely not on my critical path.
Pierre-Yves David - June 2, 2015, 7:59 a.m.
On 06/01/2015 05:03 PM, Pierre-Yves David wrote:
>>> Do you want me to update the commit message?
>>
>> I think that might help future code archaeologists, so let's go for yes?
>>
>> At this point we've got enough going on I'm having trouble keeping up
>> with all the churn at the protocol-ish layers, so I'm sure in two
>> years we'll be grateful for overly-verbose commit messages.
>
> Do you want that at documentation for each class as a details in the
> description for the each commit introducting them ?

I've done a mix of both an pushed the result as d3b8690c4ecc on 
http://42.netv6.net/marmoute-wip/mercurial/
Yuya Nishihara - June 2, 2015, 2:11 p.m.
On Mon, 01 Jun 2015 17:03:01 -0700, Pierre-Yves David wrote:
> On 06/01/2015 04:57 PM, Augie Fackler wrote:
> > On Mon, Jun 01, 2015 at 09:12:31AM -0700, Pierre-Yves David wrote:
> >> On 06/01/2015 08:16 AM, Yuya Nishihara wrote:
> >>> On Mon, 1 Jun 2015 08:01:55 +0900, Yuya Nishihara wrote:
> >>> This will slightly complicate doublepipe, so I'm not sure which is better.
> >>
> >> The original version was doing all in one. However, the buffer version using
> >> the 'os' module is isolated into the 'util' module (we avoid any 'os' usage
> >> in any top module) and the douple pipe which enforce the ssh specific logic
> >> of "forward stderr as 'remote:' output" live in the ssh specific 'sshpeer'
> >> module.

I see.

> >> Do you want me to update the commit message?
> >
> > I think that might help future code archaeologists, so let's go for yes?
> >
> > At this point we've got enough going on I'm having trouble keeping up
> > with all the churn at the protocol-ish layers, so I'm sure in two
> > years we'll be grateful for overly-verbose commit messages.
> 
> Do you want that at documentation for each class as a details in the 
> description for the each commit introducting them ?
> 
> >> (note: at some point we will replace all this with thread).
> >
> > Would we be better off to just code that up now? It might be
> > significantly simpler...
> 
> The current approach just replace on of the core object with something 
> that behave exactly the same but also unsure the server output is 
> consume in real time. This have ø impact on the rest of the code. The 
> object itself is non trivial but the impact on the code is ultra simple. 
> translating all that to threading have a much bigger impact and will be 
> much more complexe. We'll eventually get to there for bundle2 but I 
> would be happy to not be rabbit-holed with a massive refactor for a bug 
> fix that is definitely not on my critical path.

Maybe I'm noob, I don't see the point why we want to move to threading at
some time. (except that we can't do select() on Windows)

doublepipe can wait for both stdout and stderr asynchronously if they are
controlled by single blocking loop, entered by doublepipe.read() or readline()
call. Doesn't bundle2 call these methods in a timely fashion?
Pierre-Yves David - June 2, 2015, 6:38 p.m.
On 06/02/2015 07:11 AM, Yuya Nishihara wrote:
> On Mon, 01 Jun 2015 17:03:01 -0700, Pierre-Yves David wrote:
>> On 06/01/2015 04:57 PM, Augie Fackler wrote:
>>> On Mon, Jun 01, 2015 at 09:12:31AM -0700, Pierre-Yves David wrote:
>>>> On 06/01/2015 08:16 AM, Yuya Nishihara wrote:
>>>>> On Mon, 1 Jun 2015 08:01:55 +0900, Yuya Nishihara wrote:
>>>>> This will slightly complicate doublepipe, so I'm not sure which is better.
>>>>
>>>> The original version was doing all in one. However, the buffer version using
>>>> the 'os' module is isolated into the 'util' module (we avoid any 'os' usage
>>>> in any top module) and the douple pipe which enforce the ssh specific logic
>>>> of "forward stderr as 'remote:' output" live in the ssh specific 'sshpeer'
>>>> module.
>
> I see.
>
>>>> Do you want me to update the commit message?
>>>
>>> I think that might help future code archaeologists, so let's go for yes?
>>>
>>> At this point we've got enough going on I'm having trouble keeping up
>>> with all the churn at the protocol-ish layers, so I'm sure in two
>>> years we'll be grateful for overly-verbose commit messages.
>>
>> Do you want that at documentation for each class as a details in the
>> description for the each commit introducting them ?
>>
>>>> (note: at some point we will replace all this with thread).
>>>
>>> Would we be better off to just code that up now? It might be
>>> significantly simpler...
>>
>> The current approach just replace on of the core object with something
>> that behave exactly the same but also unsure the server output is
>> consume in real time. This have ø impact on the rest of the code. The
>> object itself is non trivial but the impact on the code is ultra simple.
>> translating all that to threading have a much bigger impact and will be
>> much more complexe. We'll eventually get to there for bundle2 but I
>> would be happy to not be rabbit-holed with a massive refactor for a bug
>> fix that is definitely not on my critical path.
>
> Maybe I'm noob, I don't see the point why we want to move to threading at
> some time. (except that we can't do select() on Windows)

Moving to threading would allow use to stream a bundle out, while 
receiving a bundle in. (or even non bundle thing) allowing more powerful 
and faster interaction.

It would also requires less hack with pipe and buffer.
Yuya Nishihara - June 3, 2015, 12:37 p.m.
On Tue, 02 Jun 2015 11:38:08 -0700, Pierre-Yves David wrote:
> On 06/02/2015 07:11 AM, Yuya Nishihara wrote:
> > On Mon, 01 Jun 2015 17:03:01 -0700, Pierre-Yves David wrote:
> >> On 06/01/2015 04:57 PM, Augie Fackler wrote:
> >>> On Mon, Jun 01, 2015 at 09:12:31AM -0700, Pierre-Yves David wrote:
> >>>> (note: at some point we will replace all this with thread).
> >>>
> >>> Would we be better off to just code that up now? It might be
> >>> significantly simpler...
> >>
> >> The current approach just replace on of the core object with something
> >> that behave exactly the same but also unsure the server output is
> >> consume in real time. This have ø impact on the rest of the code. The
> >> object itself is non trivial but the impact on the code is ultra simple.
> >> translating all that to threading have a much bigger impact and will be
> >> much more complexe. We'll eventually get to there for bundle2 but I
> >> would be happy to not be rabbit-holed with a massive refactor for a bug
> >> fix that is definitely not on my critical path.
> >
> > Maybe I'm noob, I don't see the point why we want to move to threading at
> > some time. (except that we can't do select() on Windows)
> 
> Moving to threading would allow use to stream a bundle out, while
> receiving a bundle in. (or even non bundle thing) allowing more powerful
> and faster interaction.
> 
> It would also requires less hack with pipe and buffer.

Well, but it would require another sort of hack to process SIGINT while
sub thread is blocked by I/O wait.
Matt Mackall - June 3, 2015, 8:02 p.m.
On Tue, 2015-06-02 at 00:59 -0700, Pierre-Yves David wrote:
> 
> On 06/01/2015 05:03 PM, Pierre-Yves David wrote:
> >>> Do you want me to update the commit message?
> >>
> >> I think that might help future code archaeologists, so let's go for yes?
> >>
> >> At this point we've got enough going on I'm having trouble keeping up
> >> with all the churn at the protocol-ish layers, so I'm sure in two
> >> years we'll be grateful for overly-verbose commit messages.
> >
> > Do you want that at documentation for each class as a details in the
> > description for the each commit introducting them ?
> 
> I've done a mix of both an pushed the result as d3b8690c4ecc on 
> http://42.netv6.net/marmoute-wip/mercurial/

I've pulled this version, thanks.

Patch

diff --git a/mercurial/sshpeer.py b/mercurial/sshpeer.py
--- a/mercurial/sshpeer.py
+++ b/mercurial/sshpeer.py
@@ -87,10 +87,12 @@  class sshpeer(wireproto.wirepeer):
 
         # while self.subprocess isn't used, having it allows the subprocess to
         # to clean up correctly later
         self.pipeo, self.pipei, self.pipee, self.subprocess = util.popen4(cmd)
 
+        self.pipei = util.bufferedinputpipe(self.pipei)
+
         # skip any noise generated by remote shell
         self._callstream("hello")
         r = self._callstream("between", pairs=("%s-%s" % ("0"*40, "0"*40)))
         lines = ["", "dummy"]
         max_noise = 500