Patchwork [STABLE] changegroup: retry read() when partial data returned

login
register
mail settings
Submitter Gregory Szorc
Date March 26, 2017, 6:56 p.m.
Message ID <414128ddd876eb33f70b.1490554619@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/19697/
State Changes Requested
Headers show

Comments

Gregory Szorc - March 26, 2017, 6:56 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1490554582 25200
#      Sun Mar 26 11:56:22 2017 -0700
# Branch stable
# Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
# Parent  ed5b25874d998ababb181a939dd37a16ea644435
changegroup: retry read() when partial data returned

We've seen a rash of "stream ended unexpectedly" errors in the
wild. This occurs when changegroup.readexactly() fails to retrieve
the exact number of requested bytes in a single stream.read()
operation.

There are several plausible theories that explain this behavior.
Many include underlying APIs not retrying automatically when
interrupted by a signal. While Python is supposed to do this,
it could be buggy. There could also be a bug in the various
stream reading layers between changegroup.readexactly() and the
underlying file descriptor.

Unfortunately, getting this error to reproduce has been extremely
difficult and no single cause has been identified.

Since networks (and even filesystems) are unreliable, it seems
wise to have some form of retry in place to mitigate this failure.
This patch adds that retry logic.

There is already basic test coverage that the exception in this
function is raised. The tests didn't change with this patch.

The only obvious negative impact to this change I can see is that
in cases where the read operation fails, there is some overhead
involved with retrying the operation. In the worst case, this is
pure overhead: we'd retry an operation and get the same outcome.
But in the best case, it avoids an intermittent/random failure.
My assumption is that the overhead to retry is small and that
the cost to avoiding a random failure is justified.

There are other changes I'd like to make to stream reading code
to help flush out this general failure of partial stream reads.
See issue4923 for some suggestions with regards to swallowing
exception and masking the underlying error. However, these
changes aren't suitable for the stable branch. This patch feels
minimally invasive and if successful would let us narrow our focus
for finding the underlying bug.
Jun Wu - March 26, 2017, 7:17 p.m.
The direction looks good. I'd like to see more solid code. Commented below.

Excerpts from Gregory Szorc's message of 2017-03-26 11:56:59 -0700:
> +    # Retry to get remaining data. In cases where the stream has errored or
> +    # is at EOF, this will do a bit of redundant work. But it helps prevent
> +    # intermittent failures and isn't common. So the overhead is acceptable.
> +    for i in range(3):

If "stream.read" does not behave as expected, assume it's totally broken.
So we need to reset the retry counter once we got data. Something like:

    def readexactly(..., retry=3):
        ....
        retryremaining = retry
        while True:
            chunk = stream.read(left)
            if not chunk:
                retryremaining -= 1
                if not retryremaining:
                    break
            retryremaining = retry
            left -= len(chunk)
            ....

> +        chunk = stream.read(left)
> +        chunks.append(chunk)
> +        left -= len(chunk)
> +        assert left >= 0
> +
> +        if not left:
> +            break
> +
> +    if left:
>          raise error.Abort(_("stream ended unexpectedly"
>                             " (got %d bytes, expected %d)")
> -                          % (len(s), n))
> -    return s
> +                          % (n - left, n))
> +
> +    return b''.join(chunks)
>  
>  def getchunk(stream):
>      """return the next chunk from stream as a string"""
Gregory Szorc - March 26, 2017, 9:26 p.m.
On Sun, Mar 26, 2017 at 12:17 PM, Jun Wu <quark@fb.com> wrote:

> The direction looks good. I'd like to see more solid code. Commented below.
>
> Excerpts from Gregory Szorc's message of 2017-03-26 11:56:59 -0700:
> > +    # Retry to get remaining data. In cases where the stream has
> errored or
> > +    # is at EOF, this will do a bit of redundant work. But it helps
> prevent
> > +    # intermittent failures and isn't common. So the overhead is
> acceptable.
> > +    for i in range(3):
>
> If "stream.read" does not behave as expected, assume it's totally broken.
> So we need to reset the retry counter once we got data. Something like:
>
>     def readexactly(..., retry=3):
>         ....
>         retryremaining = retry
>         while True:
>             chunk = stream.read(left)
>             if not chunk:
>                 retryremaining -= 1
>                 if not retryremaining:
>                     break
>             retryremaining = retry
>             left -= len(chunk)
>             ....
>

I agree with your reasoning here. However, I want to be conservative with
the patch since it is for stable. If a reviewer finds your approach
suitable for stable, I'll make the change. Otherwise, we can take the
conservative approach on stable and change to your version on default.

FWIW there are other things I may want to do on default, such as making the
retry count configurable and possibly adding some blackbox logging to help
us track down how this fails in the wild.

> +        chunk = stream.read(left)
> > +        chunks.append(chunk)
> > +        left -= len(chunk)
> > +        assert left >= 0
> > +
> > +        if not left:
> > +            break
> > +
> > +    if left:
> >          raise error.Abort(_("stream ended unexpectedly"
> >                             " (got %d bytes, expected %d)")
> > -                          % (len(s), n))
> > -    return s
> > +                          % (n - left, n))
> > +
> > +    return b''.join(chunks)
> >
> >  def getchunk(stream):
> >      """return the next chunk from stream as a string"""
>
Yuya Nishihara - March 27, 2017, 3:13 p.m.
On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1490554582 25200
> #      Sun Mar 26 11:56:22 2017 -0700
> # Branch stable
> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
> # Parent  ed5b25874d998ababb181a939dd37a16ea644435
> changegroup: retry read() when partial data returned
> 
> We've seen a rash of "stream ended unexpectedly" errors in the
> wild. This occurs when changegroup.readexactly() fails to retrieve
> the exact number of requested bytes in a single stream.read()
> operation.
> 
> There are several plausible theories that explain this behavior.
> Many include underlying APIs not retrying automatically when
> interrupted by a signal. While Python is supposed to do this,
> it could be buggy. There could also be a bug in the various
> stream reading layers between changegroup.readexactly() and the
> underlying file descriptor.
> 
> Unfortunately, getting this error to reproduce has been extremely
> difficult and no single cause has been identified.
> 
> Since networks (and even filesystems) are unreliable, it seems
> wise to have some form of retry in place to mitigate this failure.
> This patch adds that retry logic.
> 
> There is already basic test coverage that the exception in this
> function is raised. The tests didn't change with this patch.
> 
> The only obvious negative impact to this change I can see is that
> in cases where the read operation fails, there is some overhead
> involved with retrying the operation. In the worst case, this is
> pure overhead: we'd retry an operation and get the same outcome.
> But in the best case, it avoids an intermittent/random failure.
> My assumption is that the overhead to retry is small and that
> the cost to avoiding a random failure is justified.
> 
> There are other changes I'd like to make to stream reading code
> to help flush out this general failure of partial stream reads.
> See issue4923 for some suggestions with regards to swallowing
> exception and masking the underlying error. However, these
> changes aren't suitable for the stable branch. This patch feels
> minimally invasive and if successful would let us narrow our focus
> for finding the underlying bug.
> 
> diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
> --- a/mercurial/changegroup.py
> +++ b/mercurial/changegroup.py
> @@ -35,12 +35,42 @@ from . import (
>  
>  def readexactly(stream, n):
>      '''read n bytes from stream.read and abort if less was available'''
> -    s = stream.read(n)
> -    if len(s) < n:
> +    # Most of the time, stream.read() returns exactly the number of bytes
> +    # requested. Various Python APIs will perform multiple system calls
> +    # to retrieve more bytes if the first call did not return enough. This
> +    # includes cases where the system call is interrupted.
> +    #
> +    # Even with this retry logic in the Python APIs, we still see failure
> +    # to retrieve the requested number of bytes. So, we build in our own
> +    # retry layer here.
> +    left = n
> +    chunk = stream.read(n)
> +    left -= len(chunk)
> +    assert left >= 0
> +
> +    # Common case is it just works.
> +    if not left:
> +        return chunk
> +
> +    chunks = [chunk]
> +    # Retry to get remaining data. In cases where the stream has errored or
> +    # is at EOF, this will do a bit of redundant work. But it helps prevent
> +    # intermittent failures and isn't common. So the overhead is acceptable.
> +    for i in range(3):
> +        chunk = stream.read(left)
> +        chunks.append(chunk)
> +        left -= len(chunk)
> +        assert left >= 0
> +
> +        if not left:
> +            break

Do you have some number that supports this change actually mitigates the
problem?

Suppose underlying functions have bugs, I don't think we can always get
contiguous chunks with no data loss. If a few bytes lost for example, we
could read arbitrary part as a length field, and call readexactly() with
that invalid length value.
Gregory Szorc - March 27, 2017, 3:25 p.m.
> On Mar 27, 2017, at 08:13, Yuya Nishihara <yuya@tcha.org> wrote:
> 
>> On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1490554582 25200
>> #      Sun Mar 26 11:56:22 2017 -0700
>> # Branch stable
>> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
>> # Parent  ed5b25874d998ababb181a939dd37a16ea644435
>> changegroup: retry read() when partial data returned
>> 
>> We've seen a rash of "stream ended unexpectedly" errors in the
>> wild. This occurs when changegroup.readexactly() fails to retrieve
>> the exact number of requested bytes in a single stream.read()
>> operation.
>> 
>> There are several plausible theories that explain this behavior.
>> Many include underlying APIs not retrying automatically when
>> interrupted by a signal. While Python is supposed to do this,
>> it could be buggy. There could also be a bug in the various
>> stream reading layers between changegroup.readexactly() and the
>> underlying file descriptor.
>> 
>> Unfortunately, getting this error to reproduce has been extremely
>> difficult and no single cause has been identified.
>> 
>> Since networks (and even filesystems) are unreliable, it seems
>> wise to have some form of retry in place to mitigate this failure.
>> This patch adds that retry logic.
>> 
>> There is already basic test coverage that the exception in this
>> function is raised. The tests didn't change with this patch.
>> 
>> The only obvious negative impact to this change I can see is that
>> in cases where the read operation fails, there is some overhead
>> involved with retrying the operation. In the worst case, this is
>> pure overhead: we'd retry an operation and get the same outcome.
>> But in the best case, it avoids an intermittent/random failure.
>> My assumption is that the overhead to retry is small and that
>> the cost to avoiding a random failure is justified.
>> 
>> There are other changes I'd like to make to stream reading code
>> to help flush out this general failure of partial stream reads.
>> See issue4923 for some suggestions with regards to swallowing
>> exception and masking the underlying error. However, these
>> changes aren't suitable for the stable branch. This patch feels
>> minimally invasive and if successful would let us narrow our focus
>> for finding the underlying bug.
>> 
>> diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
>> --- a/mercurial/changegroup.py
>> +++ b/mercurial/changegroup.py
>> @@ -35,12 +35,42 @@ from . import (
>> 
>> def readexactly(stream, n):
>>     '''read n bytes from stream.read and abort if less was available'''
>> -    s = stream.read(n)
>> -    if len(s) < n:
>> +    # Most of the time, stream.read() returns exactly the number of bytes
>> +    # requested. Various Python APIs will perform multiple system calls
>> +    # to retrieve more bytes if the first call did not return enough. This
>> +    # includes cases where the system call is interrupted.
>> +    #
>> +    # Even with this retry logic in the Python APIs, we still see failure
>> +    # to retrieve the requested number of bytes. So, we build in our own
>> +    # retry layer here.
>> +    left = n
>> +    chunk = stream.read(n)
>> +    left -= len(chunk)
>> +    assert left >= 0
>> +
>> +    # Common case is it just works.
>> +    if not left:
>> +        return chunk
>> +
>> +    chunks = [chunk]
>> +    # Retry to get remaining data. In cases where the stream has errored or
>> +    # is at EOF, this will do a bit of redundant work. But it helps prevent
>> +    # intermittent failures and isn't common. So the overhead is acceptable.
>> +    for i in range(3):
>> +        chunk = stream.read(left)
>> +        chunks.append(chunk)
>> +        left -= len(chunk)
>> +        assert left >= 0
>> +
>> +        if not left:
>> +            break
> 
> Do you have some number that supports this change actually mitigates the
> problem?

No. I could try to get that this week.

> 
> Suppose underlying functions have bugs, I don't think we can always get
> contiguous chunks with no data loss. If a few bytes lost for example, we
> could read arbitrary part as a length field, and call readexactly() with
> that invalid length value.

Hmm. My assumption is that any error that would result in data loss effectively poisons the stream and prevents future reads.
Yuya Nishihara - March 27, 2017, 3:59 p.m.
On Mon, 27 Mar 2017 08:25:41 -0700, Gregory Szorc wrote:
> > On Mar 27, 2017, at 08:13, Yuya Nishihara <yuya@tcha.org> wrote:
> >> On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
> >> # HG changeset patch
> >> # User Gregory Szorc <gregory.szorc@gmail.com>
> >> # Date 1490554582 25200
> >> #      Sun Mar 26 11:56:22 2017 -0700
> >> # Branch stable
> >> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
> >> # Parent  ed5b25874d998ababb181a939dd37a16ea644435
> >> changegroup: retry read() when partial data returned

[...]

> >> def readexactly(stream, n):
> >>     '''read n bytes from stream.read and abort if less was available'''
> >> -    s = stream.read(n)
> >> -    if len(s) < n:
> >> +    # Most of the time, stream.read() returns exactly the number of bytes
> >> +    # requested. Various Python APIs will perform multiple system calls
> >> +    # to retrieve more bytes if the first call did not return enough. This
> >> +    # includes cases where the system call is interrupted.
> >> +    #
> >> +    # Even with this retry logic in the Python APIs, we still see failure
> >> +    # to retrieve the requested number of bytes. So, we build in our own
> >> +    # retry layer here.
> >> +    left = n
> >> +    chunk = stream.read(n)
> >> +    left -= len(chunk)
> >> +    assert left >= 0
> >> +
> >> +    # Common case is it just works.
> >> +    if not left:
> >> +        return chunk
> >> +
> >> +    chunks = [chunk]
> >> +    # Retry to get remaining data. In cases where the stream has errored or
> >> +    # is at EOF, this will do a bit of redundant work. But it helps prevent
> >> +    # intermittent failures and isn't common. So the overhead is acceptable.
> >> +    for i in range(3):
> >> +        chunk = stream.read(left)
> >> +        chunks.append(chunk)
> >> +        left -= len(chunk)
> >> +        assert left >= 0
> >> +
> >> +        if not left:
> >> +            break
> > 
> > Do you have some number that supports this change actually mitigates the
> > problem?
> 
> No. I could try to get that this week.

Many thanks.

> > Suppose underlying functions have bugs, I don't think we can always get
> > contiguous chunks with no data loss. If a few bytes lost for example, we
> > could read arbitrary part as a length field, and call readexactly() with
> > that invalid length value.
> 
> Hmm. My assumption is that any error that would result in data loss effectively poisons the stream and prevents future reads.

I was thinking e.g. some intermediate layer might not handle EINTR
appropriately and EINTR might be caught by some upper layer. If it had
a temporary buffer, its content would be lost.
Gregory Szorc - March 30, 2017, 1:20 a.m.
> On Mar 27, 2017, at 08:59, Yuya Nishihara <yuya@tcha.org> wrote:
> 
> On Mon, 27 Mar 2017 08:25:41 -0700, Gregory Szorc wrote:
>>>> On Mar 27, 2017, at 08:13, Yuya Nishihara <yuya@tcha.org> wrote:
>>>> On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
>>>> # HG changeset patch
>>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>>> # Date 1490554582 25200
>>>> #      Sun Mar 26 11:56:22 2017 -0700
>>>> # Branch stable
>>>> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
>>>> # Parent  ed5b25874d998ababb181a939dd37a16ea644435
>>>> changegroup: retry read() when partial data returned
> 
> [...]
> 
>>>> def readexactly(stream, n):
>>>>    '''read n bytes from stream.read and abort if less was available'''
>>>> -    s = stream.read(n)
>>>> -    if len(s) < n:
>>>> +    # Most of the time, stream.read() returns exactly the number of bytes
>>>> +    # requested. Various Python APIs will perform multiple system calls
>>>> +    # to retrieve more bytes if the first call did not return enough. This
>>>> +    # includes cases where the system call is interrupted.
>>>> +    #
>>>> +    # Even with this retry logic in the Python APIs, we still see failure
>>>> +    # to retrieve the requested number of bytes. So, we build in our own
>>>> +    # retry layer here.
>>>> +    left = n
>>>> +    chunk = stream.read(n)
>>>> +    left -= len(chunk)
>>>> +    assert left >= 0
>>>> +
>>>> +    # Common case is it just works.
>>>> +    if not left:
>>>> +        return chunk
>>>> +
>>>> +    chunks = [chunk]
>>>> +    # Retry to get remaining data. In cases where the stream has errored or
>>>> +    # is at EOF, this will do a bit of redundant work. But it helps prevent
>>>> +    # intermittent failures and isn't common. So the overhead is acceptable.
>>>> +    for i in range(3):
>>>> +        chunk = stream.read(left)
>>>> +        chunks.append(chunk)
>>>> +        left -= len(chunk)
>>>> +        assert left >= 0
>>>> +
>>>> +        if not left:
>>>> +            break
>>> 
>>> Do you have some number that supports this change actually mitigates the
>>> problem?
>> 
>> No. I could try to get that this week.
> 
> Many thanks.

Upon further investigation, Mozilla's Windows machines install Mercurial through the .exe installer. Running a hacked version for testing will require me to create a custom installer. That rabbit hole is too deep for this week.

I'll try to change our automation to use pip/wheels then will try to look into this for the 4.2 cycle.

> 
>>> Suppose underlying functions have bugs, I don't think we can always get
>>> contiguous chunks with no data loss. If a few bytes lost for example, we
>>> could read arbitrary part as a length field, and call readexactly() with
>>> that invalid length value.
>> 
>> Hmm. My assumption is that any error that would result in data loss effectively poisons the stream and prevents future reads.
> 
> I was thinking e.g. some intermediate layer might not handle EINTR
> appropriately and EINTR might be caught by some upper layer. If it had
> a temporary buffer, its content would be lost.

Hmmm. I'm trying to think which system calls those could be. I may have to spend some time with strace and man pages...

Patch

diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
--- a/mercurial/changegroup.py
+++ b/mercurial/changegroup.py
@@ -35,12 +35,42 @@  from . import (
 
 def readexactly(stream, n):
     '''read n bytes from stream.read and abort if less was available'''
-    s = stream.read(n)
-    if len(s) < n:
+    # Most of the time, stream.read() returns exactly the number of bytes
+    # requested. Various Python APIs will perform multiple system calls
+    # to retrieve more bytes if the first call did not return enough. This
+    # includes cases where the system call is interrupted.
+    #
+    # Even with this retry logic in the Python APIs, we still see failure
+    # to retrieve the requested number of bytes. So, we build in our own
+    # retry layer here.
+    left = n
+    chunk = stream.read(n)
+    left -= len(chunk)
+    assert left >= 0
+
+    # Common case is it just works.
+    if not left:
+        return chunk
+
+    chunks = [chunk]
+    # Retry to get remaining data. In cases where the stream has errored or
+    # is at EOF, this will do a bit of redundant work. But it helps prevent
+    # intermittent failures and isn't common. So the overhead is acceptable.
+    for i in range(3):
+        chunk = stream.read(left)
+        chunks.append(chunk)
+        left -= len(chunk)
+        assert left >= 0
+
+        if not left:
+            break
+
+    if left:
         raise error.Abort(_("stream ended unexpectedly"
                            " (got %d bytes, expected %d)")
-                          % (len(s), n))
-    return s
+                          % (n - left, n))
+
+    return b''.join(chunks)
 
 def getchunk(stream):
     """return the next chunk from stream as a string"""