Thanks a lot @aidanhs for the feedback.
A few days ago I tried to run the test binary in a loop in Rust, and that didn't trigger it reliably either. This time I did what you suggested --- actually calling cargo test
many times:
And it still does not trigger it reliably. I had to push in many commits with my trigger script, and eventually at least one failed: CI trigger · certik/fpm@2282a5d · GitHub with the usual error.
Regarding your summary, yes, that is correct. It is very hard to say with certainty what the facts are, because it is so hard to reliably reproduce.
With what I have seen, there are several different theories that seem to satisfy the facts so far:
-
There might be a faulty machine at GitHub that runs the Actions, and so you only trigger it when you get unlucky to run on that particular machine (that would explain why in all other cases the tests pass, even if you rerun them). But the fact that rerunning
cargo test
on this very machine fixes the issue suggests that even on this faulty machine the failure only happens sometimes. -
The error only happens the first time you run the tests, so there might be something special about that. I haven't seen a case that the first time
cargo test
would succeed, but it would fail later. It's always the first time that it fails. -
The error seems to happen if I invoke the binary via Rust. If I execute it from a Bash script, it does not seem to happen. So perhaps there is something that Rust does that triggers it sometimes. The failure does not seem to be caused by the binary that gets killed --- but so far I failed to reproduce if I run another binary, such as
/bin/ls
. -
I also observed that if I do not install Python (and cmake) and GFortran, then the failure rate goes quickly to (almost) zero. So I suspect the installation of those things somehow messes up the macOS image.