Mysterious panic in macOS cargo test

I'm working to get the CI of GitHub - nix-community/lorri: Your project’s nix-env [maintainer=@Profpatsch,@nyarly] back to green, and I've come to what I'm hoping is the last blocker there.

cargo test panics consistently, only on Mac runners, and all the error it produces is:

# fatal runtime error: failed to initiate panic, error 5
# error: test failed, to rerun pass '--lib'
#
# Caused by:
#   process didn't exit successfully: `/Users/runner/work/lorri/lorri/target/debug/deps/lorri-f086cde207eb53b5` (signal: 6, SIGABRT: process abort signal)

(c.f. CI: update nixpkgs, add workflow dispatch · nix-community/lorri@ed765a9 · GitHub)

The context doesn't include the test that was running, which makes it hard to start tracking down the issue.

I'll admit, I'm at a loss even to understand what "failed to initiate panic" means.

Any advice as to how to investigate this would be appreciated.

If you have direct access to a Mac that cargo test fails on, I would run it under LLDB and get a backtrace when the SIGABRT signal is raised.

The "failed to initiate panic" means some code triggered a panic and for whatever reason, the __rust_start_panic() function didn't unwind the stack.

If you haven't already, try setting the RUST_BACKTRACE environment variable to 1. The code might run long enough to call the default panic hook (which prints a backtrace to stderr) before starting the unwinding process.

Is this all pure Rust code, or could there be something non-standard going on (modifying the panic machinery, Rust is calling C which calls back into Rust, wierd hardware/libc, etc.)? I don't think I've ever seen the "failed to initiate panic" before.

2 Likes

The way I find things I can not find is to only look at half of the problem.
Maybe you can bisect? Run half the tests and see if the failure still happens. Repeat halving tests untill you narrow it down to one test.

Y'all are making good arguments for borrowing a Mac and setting up a dev environment on it to test.

It's worse than that; while the test fails consistently on Github's macos-latest runner, it runs fine on my borrowed macbook. I'm not relishing the prospect of running a test-bisection on CI, but what can you do?

You say "Screw you, Github!" and try a different CI provider. Probably Gitlab, or something that's just CI, like Circle CI.

Result of todays work:
The current state is is that macOS fails the unit tests in a mysterious way:

# fatal runtime error: failed to initiate panic, error 5
#      Running unittests (target/debug/deps/lorri-9519bc6214935066)
...
# Caused by:
655
#   process didn't exit successfully: `/Users/runner/work/lorri/lorri/target/debug/deps/lorri-f086cde207eb53b5` (signal: 6, SIGABRT: process abort signal)

This happens when running the full CI suite, or all the unit tests (either with cargo test --lib or cargo test -- ... with all the unit tests named. Running unit tests individually or in groups of half the set both pass.

The bisected run is especially confusing; the suite failing, but individual tests passing leads me to the hypothesis that there's a pair of tests that are somehow interacting. But the bisection runs all 6 pairs of two quarters of the suite (c.f. nix/ci/default.nix), so every pair should be running together there.

I'm starting to be inclined to configure a Mac test run that runs everything but unit tests, and then runs the units individually, shameful that that solution would be.

c.f. CI: update nixpkgs, add workflow dispatch · nix-community/lorri@4ba8b64 · GitHub where "Rust individual tests" succeeds, but "Rust unit tests" and "Rust and CI tests" fail.

1 Like

I'm not sure how likely that is to fix the issue, and the process of converting the CI over to gitlab or Circle isn't trivial.

Reupping a question from earlier: Does this test call into the C++ libraries? Is there any Rust-C++-Rust FFI happening?

I don't believe there's any FFI anywhere. I think I've narrowed down the cause to here:

That seems like a fairly straightforward bit of code, so why it would trigger this weird panic handler issue, but only in GHA macos-latest is unclear to me.

I think I see the source of your question though: if a panic is crossing FFI boundaries, that might cause this? I've seen reference elsewhere to using catch_unwind to protect against this issue.

Looks like expect_bash_can_fail should be your minimum reproduction set?

That's an interesting point, because running each unit test individually succeeds, including that one. c.f. CI: update nixpkgs, add workflow dispatch · nix-community/lorri@3ce75f6 · GitHub

That test runs fine in a batch of 10 tests: CI: update nixpkgs, add workflow dispatch · nix-community/lorri@3ce75f6 · GitHub (see also L268 and L270)

So there's something more going on to trigger the issue.

There is a chance that the Mac version of bash behaves badly with those flags. It depends on the specific version of bash installed on the CI bash.
I remember an issue like this on a different project.

If this is the case, then it's possible that the "borrowed" MacBook had a custom bash installed on it that fixed the bug. The default /bin/bash that is installed on MacBooks is old (mine says 3.2.57(1)-release); it's common to use e.g. Homebrew to install a more recent version and put that first in path (I have 5.1.16(1)-release installed on mine through Homebrew).

I wonder if the GHA runner only has the default bash; that could explain why it only failed there, and not on the one you borrowed.

Two things make me skeptical about that:

  1. run on its own, this test runs fine, so isolating the issue to bash on Mac strikes me as unlikely. Might be a contributing factor.

  2. the test script runs with a Nix provided bash, not the default Mac one.

c.f.

#!/nix/store/3p9daf8ic481vj4nrpkf20br0p43wbpg-execline-2.8.3.0-bin/bin/execlineb -Ws0
"importas" "OLDPATH" "PATH" "/nix/store/zay6x4zdrvm2xypdf8q561q8qcmrg959-s6-2.11.1.0/bin/s6-envdir" "/nix/store/nprmiqspd87y73h09rq4rnn5b9dwdyjc-dumped-env" "/nix/store/fcpwlfxy0z9nk9q8x5mv2vwb46iiy902-PATH_prepend" "$OLDPATH" "/nix/store/fcpwlfxy0z9nk9q8x5mv2vwb46iiy902-PATH_prepend" "/nix/store/j6wxwdvm7iry6f600njlsr5vbdhmjyyw-rustc-1.60.0/bin:/nix/store/ag2bpk0lzjvj409znklrz5krkpc5imzs-gcc-wrapper-11.3.0/bin:/nix/store/8mgbymr21ic0wic7829zz7lvjvzyrdal-rustfmt-1.60.0/bin:/nix/store/k3581svqxwzr2xhwkvv2v1qq2ii3prh2-clippy-1.60.0/bin:/nix/store/h8mvg8484hhnzw552gh47hhhdfaj0zkm-cargo-1.60.0/bin" "export" "RUST_BACKTRACE" "full" "export" "BUILD_REV_COUNT" "1" "export" "RUN_TIME_CLOSURE" "/nix/store/vrmd0v3grr4ggc2p32cdjc6igzpa4707-runtime-closure.nix" "/nix/store/fcpwlfxy0z9nk9q8x5mv2vwb46iiy902-PATH_prepend" "/nix/store/7jr7pr4c6yb85xpzay5xafs5zlcadkhz-coreutils-9.0/bin:/nix/store/iffl6dlplhv22i2xy7n1w51a5r631kmi-bash-5.1-p16/bin:/nix/store/i9wqpzj09mi14rrqylyqsp44ciznp5mc-nix-2.8.1/bin:/nix/store/cnbcjcwha08v4fi3gf89jcwiyp37jvyw-direnv-2.31.0/bin" "/nix/store/h8mvg8484hhnzw552gh47hhhdfaj0zkm-cargo-1.60.0/bin/cargo" "test" "--no-fail-fast"

where path_PREPEND is

#!/nix/store/3p9daf8ic481vj4nrpkf20br0p43wbpg-execline-2.8.3.0-bin/bin/execlineb -Ws1
"importas" "PATH" "PATH" "export" "PATH" "${1}:${PATH}" "exec" "$@"

(apologies, these are both generated scripts)

Because this is a Nix-provided Bash, it should be the same Bash in both places.

1 Like

The closest thing I came to a solution here was running with --threads 1 so that I could identify a failing test case, and then using a cfg_attr to exclude that test from macOS; hardly ideal.

Doubly so because running the tests on a different branch produces a panic in a different test. Iterating the process would mean slowly removing every test from macOS - basically we'd have to stop supporting that platform.