GitHub Actions randomly kill a test program

certik · January 21, 2020, 6:28pm

This is my first time using Rust and participating here. We have spent several days trying to make progress on the following bug:

github.com/fortran-lang/fpm

The fpm binary gets interrupted at the CI (macOS)

opened 03:02PM - 15 Jan 20 UTC

closed 05:35AM - 28 Jan 20 UTC

certik

There is a bug at our CI that I haven't been able to figure out yet. Here is an …example of it: https://github.com/fortran-lang/fpm/runs/390475601. Here is what I know: 1. It only happens on macOS, never on Linux or Windows 2. Restarting the build typically fixes it (sometimes it fails 2x or 3x in a row, but eventually it always passes) 3. The `cargo test` runs in parallel by default, so I set `-j1` to run in serial. That seemed to improve how often it fails (although I could be wrong on that). It still fails sometimes however, so the actual bug is still there. 4. The error is: ``` thread 'test_2' panicked at 'Unexpected failure. code=<interrupted> stderr=`````` command=`"/Users/runner/runners/2.163.1/work/fpm/fpm/target/x86_64-apple-darwin/debug/fpm" "build"` code=<interrupted> stdout=`````` stderr=`````` ', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/assert_cmd-0.10.2/src/assert.rs:148:17 ``` Which is caused by the `assert_cmd` package, which just uses the Rust's [std::process::Command](https://doc.rust-lang.org/std/process/struct.Command.html), when the output command did not succeed, but it also failed to retrieve the error code, which according to the [documentation](https://doc.rust-lang.org/std/process/struct.ExitStatus.html#method.code) means the process was interrupted by the system (with some signal like SIGKILL).

Essentially, the following program:

fn main() {
    println!("Command: --help");
}

gets randomly killed with SIGKILL (signal 9) when I run it from a Rust testsuite.

I can't reproduce it on macOS by hand, so I have been debugging it by submitting commits with a script, trying to reproduce it, for example the PR https://github.com/certik/fpm/pull/2 has 549 commits...

I also reported it to GitHub, but they can't reproduce it either. As a new user I cannot post more than 2 links, but you can find the link in the above issue. At the end of the issue I describe how we worked around it for now, but it is hackish and does not fix the underlying problem.

Has anyone seen something similar? What would you suggest me to do?

aidanhs · January 22, 2020, 11:42pm

Wow, this seems like a pretty frustrating issue.

I've done some digging on your tests (in particular CI trigger · certik/fpm@c6d2185 · GitHub), and to summarise that and re-state what you said above, to make sure I understand:

you have a number of tests that each run a binary that just prints a line
one of these binaries failed with SIGKILL (and very quickly too, i.e. it's not some forcible killing of a long-running process)
(in this test) you have observed that re-running the same test in a CI run will eventually succeed
(in ???) you have observed that using a single test threads will also occasionally fail

Some comments:

From 3 I think we can eliminate the go bug you linked to. In that case the toolchain was generating binaries that wouldn't even start, whereas the run in 3 indicates that Cargo has not changed anything (per all the Fresh)...and the run is successful. To me, this indicates it's a transient issue that you should be able to reproduce by running the test runner binary directly in a loop e.g. 500 times - it should be pretty quick
I've not seen anything like this before, and I currently doubt it's the fault of the Rust toolchain given it's transient - to my knowledge there is a small set of things that a binary can do to cause it to be SIGKILLed I think we can eliminate the binary being invalid.
I would be very interested in seeing an example of using a single test thread with your binary that just prints a line and that failing. My top suspicion is something (be it the test runner, filesystem, the binary (somehow!?), the OS) tripping over itself in quickly spawning this newly created binary concurrently. Unlikely I know, but a useful data point.

Suggested next steps:

Build the tests with cargo, then invoke the test runner binary (with a name like target/debug/deps/fpm-8ae8e63dbf52e46d) directly in a loop 500 times (with a break out on failure). Hopefully that should let you consistently reproduce on each push, without taking too long.
If you can reproduce, start logging the end of the OSX equivalent of the syslog (apple system log I suppose? I'm not familiar with OSX), just in case
If you can reproduce, do the same but invoke the test runner with a single test thread.
If you can reproduce, do the same but under an strace equivalent (dtruss I guess?), outputting to a file each time in the loop and only catting the one that failed.

certik · January 23, 2020, 1:24am

Thanks a lot @aidanhs for the feedback.

A few days ago I tried to run the test binary in a loop in Rust, and that didn't trigger it reliably either. This time I did what you suggested --- actually calling cargo test many times:

And it still does not trigger it reliably. I had to push in many commits with my trigger script, and eventually at least one failed: CI trigger · certik/fpm@2282a5d · GitHub with the usual error.

Regarding your summary, yes, that is correct. It is very hard to say with certainty what the facts are, because it is so hard to reliably reproduce.

With what I have seen, there are several different theories that seem to satisfy the facts so far:

There might be a faulty machine at GitHub that runs the Actions, and so you only trigger it when you get unlucky to run on that particular machine (that would explain why in all other cases the tests pass, even if you rerun them). But the fact that rerunning cargo test on this very machine fixes the issue suggests that even on this faulty machine the failure only happens sometimes.
The error only happens the first time you run the tests, so there might be something special about that. I haven't seen a case that the first time cargo test would succeed, but it would fail later. It's always the first time that it fails.
The error seems to happen if I invoke the binary via Rust. If I execute it from a Bash script, it does not seem to happen. So perhaps there is something that Rust does that triggers it sometimes. The failure does not seem to be caused by the binary that gets killed --- but so far I failed to reproduce if I run another binary, such as /bin/ls.
I also observed that if I do not install Python (and cmake) and GFortran, then the failure rate goes quickly to (almost) zero. So I suspect the installation of those things somehow messes up the macOS image.

aidanhs · January 24, 2020, 11:50am

Ok this seems very interesting. I have a vague memory of in Rust CI somewhere, there being an issue where something thought a binary was written to disk, but accessing it didn't work and had similarly inscrutable failures (though rather different in failure mode). Unfortunately my memory is too hazy for any details.

Can you try inserting the commands to sync the filesystem after your build step? i.e. cargo test --verbose --no-run -> cargo test --verbose --no-run && sync && sudo purge.

One other thing to do in parallel might be to log some details about the machine, e.g. machine id (unsure about OSX equivalent), end of syslog on failure, information about disk/partitions (it's a bit annoying none of this information is logged by default on job startup).

certik · January 27, 2020, 6:54pm

Thanks @aidanhs. First I tested sync and purge: Repr11 by certik · Pull Request #8 · certik/fpm · GitHub, that still failed with the usual error: CI trigger · certik/fpm@36da12c · GitHub.

I then tested first running the executable (that sometimes gets killed) by hand, and only then via cargo test: Repr12 by certik · Pull Request #9 · certik/fpm · GitHub. That failed a few times such as in CI trigger · certik/fpm@98ce685 · GitHub, but the weird thing is that when I run it by hand, it succeeds, but it fails when run via cargo test. This is actually consistent with my initial experience --- I used to run the tests via a bash script and I never observed a failure. Then I switched to cargo test and that's when failures started to occur.

So I think the bug only happens when the executable is invoked via cargo test.

certik · January 28, 2020, 1:35am

I made some breakthrough in Repr15 by certik · Pull Request #13 · certik/fpm · GitHub. What I found out there is that the following patch:

--- a/tests/cli.rs
+++ b/tests/cli.rs
@@ -45,7 +45,7 @@ impl Success2 for Assert {
 
 #[test]
 fn test_help() {
-    let mut cmd = Command::cargo_bin("fpm").unwrap();
+    let mut cmd = Command::new("./target/debug/fpm");
     cmd.arg("--help");
     cmd.assert()
         .success2()

Fixes the problem.

Here is what I found. I first introduced this patch, and let it run for 30 commits (right above the comment Repr15 by certik · Pull Request #13 · certik/fpm · GitHub). When I saw 20 straight successes, I assumed this will all pass, so I reverted this patch (to use cargo_bin("fpm") again) and submitted another 30 commits. In those new 30 commits, 11 failed, as expected. However, in those first 30 commits that should have all passed, somehow two failed at the very end:

My theory is that there are about 5 macOS jobs running at the same time, so once the bad commits (with cargo_bin("fpm")) started running, somehow they corrupted the macOS (virtual?) machines and that caused even the good commits (with new("./target/debug/fpm)) to also fail.

To be sure, I reverted again to new("./target/debug/fpm) and submitted 30 commits, they all passed. After that I submitted another 50 commits and they also all passed.

certik · January 28, 2020, 1:47am

Let's brainstorm what could the cargo_bin be doing to cause this bug. It is defined in the assert_cmd package at (for the version 0.10.2 that I've been using, as recommended by Testing - Command Line Applications in Rust):

github.com

assert-rs/assert_cmd/blob/73d82c243a94886b2bbc50677ea46bc03d501f7d/src/cargo.rs#L159


      
          impl CommandCargoExt for process::Command {
              fn main_binary() -> Result<Self, CargoError> {
                  let runner = escargot::CargoBuild::new()
                      .current_release()
                      .current_target()
                      .run()
                      .map_err(CargoError::with_cause)?;
                  Ok(runner.command())
              }
          
          
    fn cargo_bin<S: AsRef<ffi::OsStr>>(name: S) -> Result<Self, CargoError> {
                  let runner = escargot::CargoBuild::new()
                      .bin(name)
                      .current_release()
                      .current_target()
                      .run()
                      .map_err(CargoError::with_cause)?;
                  Ok(runner.command())
              }
          
          
    fn cargo_example<S: AsRef<ffi::OsStr>>(name: S) -> Result<Self, CargoError> {

I noticed that the latest assert_cmd package reworked this cargo_bin function. The implementation that I've been using, as shown in this comment, seems to rebuild the package using Cargo sometimes, if I understand it correctly. That seems like a very complex thing to do, so I can easily see tons of opportunities for a bug like we have been seeing. The solution is to simply call the already pre-built binary. So if the build itself fails for some random reason, one can restart the test. But once it builds, the test should 100% work.

Unless the macOS machine itself gets corrupted by another test, as shown above.

certik · January 28, 2020, 5:38am

This PR seems to fix this issue: Use Command::new() instead of Command::cargo_new() by certik · Pull Request #29 · fortran-lang/fpm · GitHub, and I tested it with 30 commits (Repr16 by certik · Pull Request #14 · certik/fpm · GitHub) and it seems to work robustly. I don't know exactly why that is, but everything seems to suggest that cargo_bin is not robust on macOS and should not be used.

aidanhs · January 29, 2020, 5:56pm

Wow, great find! Still two things that confuse me, but they may remain forever unsolved. First:

This is mad - it's a struggle for me to hypothesise what could cause interference in another VM. I'm more inclined to believe that it's just become even more intermittent - I suppose you'll see if it comes back!

Second: I'm still not sure what could be causing a SIGKILL - it's a very unusual signal to be killed with (though I'm not familiar with OSX). I suppose it's cargo dying, rather than the test in that case. The cargo command should not touch the files to rebuild because it's already done... I reckon a full strace-equivalent log might still reveal something.

Let us know if the issue returns! It's a really interesting problem.

certik · January 29, 2020, 7:52pm

@aidanhs, indeed, I am puzzled by it too. The good news is that now when our master has the "fix" that we think fixes the problem, if the problem comes back, then we'll investigate more. If it does not come back, then that would very strongly suggest that somehow the VM itself got corrupted. Yes, it's a wild speculation, but it's the only idea that makes sense to me given the facts so far.

aidanhs · February 12, 2020, 3:19pm

Cargo is encountering a similiar problem with OSX builds at Update CI for upcoming macOS changes · Issue #7821 · rust-lang/cargo · GitHub, and have come up with a non-cargo reproduction.

system · May 12, 2020, 3:19pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Library tests pass, but sometimes segfault afterwards? help	7	1737	January 12, 2023
GitHub Actions vs Linux laptop vs M1 Mac: Different behaviour?	6	578	January 21, 2023
Rustfmt behaving differently between systems help	4	596	October 11, 2023
Help wanted debugging failing async tests in an open source project help	17	1246	April 16, 2022
Tests fail on ubuntu but not when run separately help	8	472	January 28, 2023

GitHub Actions randomly kill a test program

Related topics