Help wanted debugging failing async tests in an open source project

I'm unable to reproduce on my machine these failing async tests:

Especially for the tests which are only failing on macOS, since I don't have access to any Macs.

Any help in debugging is very much appreciated.

As you have commented yourself, it is probably a race condition. The unfortunate thing about race conditions is that they are Heisenbugs - which means that there is no good way of directly debugging it. You can try looking at the panic location and work your way backwards from it (like why it panicked at all).

If it's a race condition, it doesn't depend which platform (aka OS) you are on. It so happens that the MacOS vms on Github (probably Azure, but I am not sure) trigger the race condition, but that's where the significance ends.

1 Like

If anyone could reliably reproduce the failure on their machine, that would make it easier for debugging, right?

That's the thing with Heisenbugs - it cannot be reliably reproduced. It is a matter of probability - there is no telling when it will be trigerred. In a way, I think that it is lucky that this got caught early on in a CI pipeline - bugs of this nature are a nightmare to detect.
I'd recommend considering the debug dump to be correct and use it for debugging.

2 Likes

Right... I need to set RUST_BACKTRACE=full, yes? The current stack trace seems to be incomplete. I had the misguided impression that setting it to short wouldn't truncate the stack trace.

Yes that should give you a better trace.

1 Like

Looks like it's possible to SSH into the runner: GitHub - mxschmitt/action-tmate: Debug your GitHub Actions via SSH by using tmate to get access to the runner system itself.

Turns out the stack trace is pretty much useless, as it doesn't cover the async tasks.

welp... I guess a debugger is not going to be much help either:

Debugging a crashdump of an async Rust program requires both intimiate knowledge of executor runtime internals and a certain level of expertise in using debuggers.

My best bet seems to be to switch to tokio and use tokio-console?

Considering that async-std seems to be semi-abandoned according to this thread:

I guess I'm misunderstanding something. Turning on the tracing features of the relevant crates (and setting up tracing before each test) should help, right?

Is that a question for how to use tokio-console?

It's not. Just trying to figure out how I could debug these test failures.

It might be better to read the code which is invoked by the failing test. Conventional debugging often falls very flat on its face when you have race conditions.

1 Like

Hmm... That's the thing, I've read the code over and over again, but don't see where it could go wrong. I think tracing would really help. Or having a full stack trace + a sequence diagram :joy:

So I've switched to using tracing instead of println! in the tests.

This particular race condition seems to be caused by the RwLock<VecDeque<_>> where the expected values are stored:

We can see from the log output that the Message::Handshake(_) is received here:

but then the assert happens too late as it's blocked on the RwLock::write call here:

i.e. the asserts can happen in the wrong order because of how RwLock works.

Switched to using crossbeam_queue::SegQueue, but now I have a problem with access to the 2 queues not being synchronized:

i.e. downwards_expected_types.pop() and downwards_expected.pop() may get interleaved between threads.

Unfortunately, I've tried but I cannot easily combine the 2 queues.

So, same problem? :disappointed:

Not sure what else I can try anymore:

But will try to add tracing logs for all the operators in my library.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.