Submitter | phabricator |
---|---|
Date | Feb. 28, 2020, 6:55 p.m. |
Message ID | <differential-rev-PHID-DREV-3knzstkspvfwgbwpqtqc-req@mercurial-scm.org> |
Download | mbox | patch |
Permalink | /patch/45396/ |
State | Superseded |
Headers | show |
Comments
durin42 added a comment. This script feels extremely like a flaky test waiting to happen. Is there no alternative for this? REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: durin42, mercurial-devel
marmoute added a comment. By using explicit wait on signal (through the fs) that each process reached the appropriate file. We avoid flackyness. There are already multiple use of this approach in the test suite, that does not suffer from flackyness (unlike the wheelbarrow of flaky test relying on sleep for sync). REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: durin42, mercurial-devel
durin42 added a comment.
In D8189#123280 <https://phab.mercurial-scm.org/D8189#123280>, @marmoute wrote:
> By using explicit wait on signal (through the fs) that each process reached the appropriate file. We avoid flackyness. There are already multiple use of this approach in the test suite, that does not suffer from flackyness (unlike the wheelbarrow of flaky test relying on sleep for sync).
You avoid flakyness iff the test manages to finish this step in under 20 seconds (in the next change, as an example). Which is to say, this is still a flake waiting to happen, you've just made it less likely. I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?
I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point?
REPOSITORY
rHG Mercurial
CHANGES SINCE LAST ACTION
https://phab.mercurial-scm.org/D8189/new/
REVISION DETAIL
https://phab.mercurial-scm.org/D8189
To: marmoute, #hg-reviewers
Cc: durin42, mercurial-devel
mharbison72 added a comment.
In D8189#123323 <https://phab.mercurial-scm.org/D8189#123323>, @durin42 wrote:
> I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?
I filed a bug about this that self-archived, but:
$ echo ' $ sleep 10' > test-timeout.t
$ time ./run-tests.py --local test-timeout.t -t 5
running 1 tests using 1 parallel processes
t
Failed test-timeout.t: timed out
# Ran 1 tests, 0 skipped, 1 failed.
python hash seed: 204038743
real 0m10.363s
user 0m0.000s
sys 0m0.030s
So it looks like tests never timeout, but then the result is discarded afterward if the timeout period elapsed. I can reproduce it on Windows and macOS.
REPOSITORY
rHG Mercurial
CHANGES SINCE LAST ACTION
https://phab.mercurial-scm.org/D8189/new/
REVISION DETAIL
https://phab.mercurial-scm.org/D8189
To: marmoute, #hg-reviewers
Cc: mharbison72, durin42, mercurial-devel
marmoute added a comment. In D8189#123323 <https://phab.mercurial-scm.org/D8189#123323>, @durin42 wrote: > In D8189#123280 <https://phab.mercurial-scm.org/D8189#123280>, @marmoute wrote: > >> By using explicit wait on signal (through the fs) that each process reached the appropriate file. We avoid flackyness. There are already multiple use of this approach in the test suite, that does not suffer from flackyness (unlike the wheelbarrow of flaky test relying on sleep for sync). > > You avoid flakyness iff the test manages to finish this step in under 20 seconds (in the next change, as an example). Which is to say, this is still a flake waiting to happen, you've just made it less likely. I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts? The 20 seconds seems like a lots of margin already, but I am fine with bumping it more it that make your more confortable. Waiting for the test timeout is not a reasonable option because the test is killed without any details (and it is LOONG). The most common case for reaching the timeout is for one of the process to crash before reaching the checkpoing. When that happens, we want to be able to read the traceback. The second most common is code misbehaving and not going through the expected codepath. We also was to get output in this case. So in short, we need a clean way out in case of error and I have no better option than a (possibly long) timeout right now. > I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point? I am extremly sad too. But last time I checked, it was still not the case. We detect plateform and use small increment on better plateform (but I would rather follow up for that). REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: mharbison72, durin42, mercurial-devel
marmoute added a comment. In D8189#123521 <https://phab.mercurial-scm.org/D8189#123521>, @mharbison72 wrote: > In D8189#123323 <https://phab.mercurial-scm.org/D8189#123323>, @durin42 wrote: > >> I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts? > > I filed a bug about this that self-archived, but: > > $ echo ' $ sleep 10' > test-timeout.t > $ time ./run-tests.py --local test-timeout.t -t 5 > running 1 tests using 1 parallel processes > t > Failed test-timeout.t: timed out > # Ran 1 tests, 0 skipped, 1 failed. > python hash seed: 204038743 > real 0m10.363s > user 0m0.000s > sys 0m0.030s > > So it looks like tests never timeout, but then the result is discarded afterward if the timeout period elapsed. I can reproduce it on Windows and macOS. Do you have a link to the bug ? REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: mharbison72, durin42, mercurial-devel
mharbison72 added a comment. In D8189#123526 <https://phab.mercurial-scm.org/D8189#123526>, @marmoute wrote: > In D8189#123521 <https://phab.mercurial-scm.org/D8189#123521>, @mharbison72 wrote: > >> In D8189#123323 <https://phab.mercurial-scm.org/D8189#123323>, @durin42 wrote: >> >>> I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts? >> >> I filed a bug about this that self-archived, but: >> >> $ echo ' $ sleep 10' > test-timeout.t >> $ time ./run-tests.py --local test-timeout.t -t 5 >> running 1 tests using 1 parallel processes >> t >> Failed test-timeout.t: timed out >> # Ran 1 tests, 0 skipped, 1 failed. >> python hash seed: 204038743 >> real 0m10.363s >> user 0m0.000s >> sys 0m0.030s >> >> So it looks like tests never timeout, but then the result is discarded afterward if the timeout period elapsed. I can reproduce it on Windows and macOS. > > Do you have a link to the bug ? https://bz.mercurial-scm.org/show_bug.cgi?id=6125 > Waiting for the test timeout is not a reasonable option because the test is killed without any details (and it is LOONG). The most common case for reaching the timeout is for one of the process to crash before reaching the checkpoing. When that happens, we want to be able to read the traceback. The second most common is code misbehaving and not going through the expected codepath. We also was to get output in this case. So in short, we need a clean way out in case of error and I have no better option than a (possibly long) timeout right now. I wonder if the test harness can be modified to process the data collected up to the point of the timeout, so that it's obvious what is getting stuck. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: mharbison72, durin42, mercurial-devel
marmoute added a comment. Gentle ping on this patch. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: mharbison72, durin42, mercurial-devel
durin42 added a comment. I'm still -0 on this: I'd rather we found an approach that didn't require sleeping for so long. Perhaps a Python script would be a better fit here? (I won't block this landing, but I won't push it.) REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: mharbison72, durin42, mercurial-devel
marmoute added a comment. marmoute added a subscriber: durin42. In D8189#124122 <https://phab.mercurial-scm.org/D8189#124122>, @durin42 wrote: > I'm still -0 on this: I'd rather we found an approach that didn't require sleeping for so long. Perhaps a Python script would be a better fit here? > (I won't block this landing, but I won't push it.) What abotu replacing the sleep 1 by a `python -c "import time; time.sleep(0.1)" would that make you happy ? REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: durin42, mharbison72, mercurial-devel
durin42 added a comment. In D8189#124124 <https://phab.mercurial-scm.org/D8189#124124>, @marmoute wrote: > In D8189#124122 <https://phab.mercurial-scm.org/D8189#124122>, @durin42 wrote: > >> I'm still -0 on this: I'd rather we found an approach that didn't require sleeping for so long. Perhaps a Python script would be a better fit here? >> (I won't block this landing, but I won't push it.) > > What abotu replacing the sleep 1 by a `python -c "import time; time.sleep(0.1)" would that make you happy ? Happy? No, not really. I think I'd rather it was a Python script, but what I'd _really_ rather (and what would actually make me happy) is that we didn't have sleep-required steps like this at all. They're inherently racy , and I feel like there's got to be a better solution. At this point I don't have the patience to try and work through this patch, so you'll need to find a different reviewer. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: durin42, mharbison72, mercurial-devel
marmoute added a comment. marmoute added a subscriber: durin42. In D8189#123522 <https://phab.mercurial-scm.org/D8189#123522>, @marmoute wrote: > In D8189#123323 <https://phab.mercurial-scm.org/D8189#123323>, @durin42 wrote: > >> […] >> I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point? > > I am extremly sad too. But last time I checked, it was still not the case. We detect plateform and use small increment on better plateform (but I would rather follow up for that). Good news, even is sub-second is not expect to work on all plateforms, I found out that we already use it in our test suite. So any plateform that does not support it are already broken. I'll send an update soon. I am adding a small change to have the local timeout adjust itself according to the global time out. I am not aware of real live issues with the local timeout, but adding that logic is simple enough. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: durin42, mharbison72, mercurial-devel
pulkit added a comment. I wanted to help with things here but unfortunately I have ~0 experience with shell scripts and the kind of process testing going in next few patches. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8189/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8189 To: marmoute, #hg-reviewers Cc: pulkit, durin42, mharbison72, mercurial-devel
Patch
diff --git a/tests/testlib/wait-on-file b/tests/testlib/wait-on-file new file mode 100755 --- /dev/null +++ b/tests/testlib/wait-on-file @@ -0,0 +1,32 @@ +#!/bin/bash +# +# wait up to TIMEOUT seconds until a WAIT_ON_FILE is created. +# +# In addition, this script can create CREATE_FILE once it is ready to wait. + +if [ $# -lt 2 ] || [ $# -gt 3 ]; then + echo $# + echo "USAGE: $0 TIMEOUT WAIT_ON_FILE [CREATE_FILE]" +fi + +timer="$1" +wait_on="$2" +create="" +if [ $# -eq 3 ]; then + create="$3" +fi + +if [ -n "$create" ]; +then + touch "$create" + create="" +fi +while [ "$timer" -gt 0 ] && [ ! -f "$wait_on" ]; +do + timer=$(( timer - 1)) + sleep 1 +done +if [ "$timer" -le 0 ]; then + echo "file not created after $1 seconds: $wait_on" >&2 + exit 1 +fi