GitHub Actions randomly kill a test program

This is my first time using Rust and participating here. We have spent several days trying to make progress on the following bug:

Essentially, the following program:

fn main() {
    println!("Command: --help");
}

gets randomly killed with SIGKILL (signal 9) when I run it from a Rust testsuite.

I can't reproduce it on macOS by hand, so I have been debugging it by submitting commits with a script, trying to reproduce it, for example the PR https://github.com/certik/fpm/pull/2 has 549 commits...

I also reported it to GitHub, but they can't reproduce it either. As a new user I cannot post more than 2 links, but you can find the link in the above issue. At the end of the issue I describe how we worked around it for now, but it is hackish and does not fix the underlying problem.

Has anyone seen something similar? What would you suggest me to do?

Wow, this seems like a pretty frustrating issue.

I've done some digging on your tests (in particular https://github.com/certik/fpm/runs/396004690), and to summarise that and re-state what you said above, to make sure I understand:

  1. you have a number of tests that each run a binary that just prints a line
  2. one of these binaries failed with SIGKILL (and very quickly too, i.e. it's not some forcible killing of a long-running process)
  3. (in this test) you have observed that re-running the same test in a CI run will eventually succeed
  4. (in ???) you have observed that using a single test threads will also occasionally fail

Some comments:

  • From 3 I think we can eliminate the go bug you linked to. In that case the toolchain was generating binaries that wouldn't even start, whereas the run in 3 indicates that Cargo has not changed anything (per all the Fresh)...and the run is successful. To me, this indicates it's a transient issue that you should be able to reproduce by running the test runner binary directly in a loop e.g. 500 times - it should be pretty quick :slight_smile:
  • I've not seen anything like this before, and I currently doubt it's the fault of the Rust toolchain given it's transient - to my knowledge there is a small set of things that a binary can do to cause it to be SIGKILLed I think we can eliminate the binary being invalid.
  • I would be very interested in seeing an example of using a single test thread with your binary that just prints a line and that failing. My top suspicion is something (be it the test runner, filesystem, the binary (somehow!?), the OS) tripping over itself in quickly spawning this newly created binary concurrently. Unlikely I know, but a useful data point.

Suggested next steps:

  1. Build the tests with cargo, then invoke the test runner binary (with a name like target/debug/deps/fpm-8ae8e63dbf52e46d) directly in a loop 500 times (with a break out on failure). Hopefully that should let you consistently reproduce on each push, without taking too long.
  2. If you can reproduce, start logging the end of the OSX equivalent of the syslog (apple system log I suppose? I'm not familiar with OSX), just in case
  3. If you can reproduce, do the same but invoke the test runner with a single test thread.
  4. If you can reproduce, do the same but under an strace equivalent (dtruss I guess?), outputting to a file each time in the loop and only catting the one that failed.
3 Likes

Thanks a lot @aidanhs for the feedback.

A few days ago I tried to run the test binary in a loop in Rust, and that didn't trigger it reliably either. This time I did what you suggested --- actually calling cargo test many times:

And it still does not trigger it reliably. I had to push in many commits with my trigger script, and eventually at least one failed: https://github.com/certik/fpm/runs/404241856 with the usual error.

Regarding your summary, yes, that is correct. It is very hard to say with certainty what the facts are, because it is so hard to reliably reproduce.

With what I have seen, there are several different theories that seem to satisfy the facts so far:

  • There might be a faulty machine at GitHub that runs the Actions, and so you only trigger it when you get unlucky to run on that particular machine (that would explain why in all other cases the tests pass, even if you rerun them). But the fact that rerunning cargo test on this very machine fixes the issue suggests that even on this faulty machine the failure only happens sometimes.

  • The error only happens the first time you run the tests, so there might be something special about that. I haven't seen a case that the first time cargo test would succeed, but it would fail later. It's always the first time that it fails.

  • The error seems to happen if I invoke the binary via Rust. If I execute it from a Bash script, it does not seem to happen. So perhaps there is something that Rust does that triggers it sometimes. The failure does not seem to be caused by the binary that gets killed --- but so far I failed to reproduce if I run another binary, such as /bin/ls.

  • I also observed that if I do not install Python (and cmake) and GFortran, then the failure rate goes quickly to (almost) zero. So I suspect the installation of those things somehow messes up the macOS image.

1 Like

Ok this seems very interesting. I have a vague memory of in Rust CI somewhere, there being an issue where something thought a binary was written to disk, but accessing it didn't work and had similarly inscrutable failures (though rather different in failure mode). Unfortunately my memory is too hazy for any details.

Can you try inserting the commands to sync the filesystem after your build step? i.e. cargo test --verbose --no-run -> cargo test --verbose --no-run && sync && sudo purge.

One other thing to do in parallel might be to log some details about the machine, e.g. machine id (unsure about OSX equivalent), end of syslog on failure, information about disk/partitions (it's a bit annoying none of this information is logged by default on job startup).

2 Likes

Thanks @aidanhs. First I tested sync and purge: https://github.com/certik/fpm/pull/8, that still failed with the usual error: https://github.com/certik/fpm/runs/407921739.

I then tested first running the executable (that sometimes gets killed) by hand, and only then via cargo test: https://github.com/certik/fpm/pull/9. That failed a few times such as in https://github.com/certik/fpm/runs/407975246, but the weird thing is that when I run it by hand, it succeeds, but it fails when run via cargo test. This is actually consistent with my initial experience --- I used to run the tests via a bash script and I never observed a failure. Then I switched to cargo test and that's when failures started to occur.

So I think the bug only happens when the executable is invoked via cargo test.

I made some breakthrough in https://github.com/certik/fpm/pull/13. What I found out there is that the following patch:

--- a/tests/cli.rs
+++ b/tests/cli.rs
@@ -45,7 +45,7 @@ impl Success2 for Assert {
 
 #[test]
 fn test_help() {
-    let mut cmd = Command::cargo_bin("fpm").unwrap();
+    let mut cmd = Command::new("./target/debug/fpm");
     cmd.arg("--help");
     cmd.assert()
         .success2()

Fixes the problem.

Here is what I found. I first introduced this patch, and let it run for 30 commits (right above the comment https://github.com/certik/fpm/pull/13#issuecomment-579004402). When I saw 20 straight successes, I assumed this will all pass, so I reverted this patch (to use cargo_bin("fpm") again) and submitted another 30 commits. In those new 30 commits, 11 failed, as expected. However, in those first 30 commits that should have all passed, somehow two failed at the very end:


My theory is that there are about 5 macOS jobs running at the same time, so once the bad commits (with cargo_bin("fpm")) started running, somehow they corrupted the macOS (virtual?) machines and that caused even the good commits (with new("./target/debug/fpm)) to also fail.

To be sure, I reverted again to new("./target/debug/fpm) and submitted 30 commits, they all passed. After that I submitted another 50 commits and they also all passed.

Let's brainstorm what could the cargo_bin be doing to cause this bug. It is defined in the assert_cmd package at (for the version 0.10.2 that I've been using, as recommended by https://rust-cli.github.io/book/tutorial/testing.html#testing-cli-applications-by-running-them):

I noticed that the latest assert_cmd package reworked this cargo_bin function. The implementation that I've been using, as shown in this comment, seems to rebuild the package using Cargo sometimes, if I understand it correctly. That seems like a very complex thing to do, so I can easily see tons of opportunities for a bug like we have been seeing. The solution is to simply call the already pre-built binary. So if the build itself fails for some random reason, one can restart the test. But once it builds, the test should 100% work.

Unless the macOS machine itself gets corrupted by another test, as shown above.

This PR seems to fix this issue: https://github.com/fortran-lang/fpm/pull/29, and I tested it with 30 commits (https://github.com/certik/fpm/pull/14) and it seems to work robustly. I don't know exactly why that is, but everything seems to suggest that cargo_bin is not robust on macOS and should not be used.

Wow, great find! Still two things that confuse me, but they may remain forever unsolved. First:

This is mad - it's a struggle for me to hypothesise what could cause interference in another VM. I'm more inclined to believe that it's just become even more intermittent - I suppose you'll see if it comes back!

Second: I'm still not sure what could be causing a SIGKILL - it's a very unusual signal to be killed with (though I'm not familiar with OSX). I suppose it's cargo dying, rather than the test in that case. The cargo command should not touch the files to rebuild because it's already done... I reckon a full strace-equivalent log might still reveal something.

Let us know if the issue returns! It's a really interesting problem.

2 Likes

@aidanhs, indeed, I am puzzled by it too. The good news is that now when our master has the "fix" that we think fixes the problem, if the problem comes back, then we'll investigate more. If it does not come back, then that would very strongly suggest that somehow the VM itself got corrupted. Yes, it's a wild speculation, but it's the only idea that makes sense to me given the facts so far.

Cargo is encountering a similiar problem with OSX builds at https://github.com/rust-lang/cargo/issues/7821, and have come up with a non-cargo reproduction.

1 Like