Help wanted debugging failing async tests in an open source project

teohhanhui · January 5, 2022, 1:32pm

I'm unable to reproduce on my machine these failing async tests:

https://github.com/teohhanhui/callbag-rs/issues/2

Especially for the tests which are only failing on macOS, since I don't have access to any Macs.

Any help in debugging is very much appreciated.

RedDocMD · January 5, 2022, 1:37pm

As you have commented yourself, it is probably a race condition. The unfortunate thing about race conditions is that they are Heisenbugs - which means that there is no good way of directly debugging it. You can try looking at the panic location and work your way backwards from it (like why it panicked at all).

If it's a race condition, it doesn't depend which platform (aka OS) you are on. It so happens that the MacOS vms on Github (probably Azure, but I am not sure) trigger the race condition, but that's where the significance ends.

teohhanhui · January 5, 2022, 1:38pm

If anyone could reliably reproduce the failure on their machine, that would make it easier for debugging, right?

RedDocMD · January 5, 2022, 1:41pm

That's the thing with Heisenbugs - it cannot be reliably reproduced. It is a matter of probability - there is no telling when it will be trigerred. In a way, I think that it is lucky that this got caught early on in a CI pipeline - bugs of this nature are a nightmare to detect.
I'd recommend considering the debug dump to be correct and use it for debugging.

teohhanhui · January 5, 2022, 1:48pm

Right... I need to set RUST_BACKTRACE=full, yes? The current stack trace seems to be incomplete. I had the misguided impression that setting it to short wouldn't truncate the stack trace.

RedDocMD · January 5, 2022, 1:48pm

Yes that should give you a better trace.

teohhanhui · January 5, 2022, 5:26pm

Looks like it's possible to SSH into the runner: GitHub - mxschmitt/action-tmate: Debug your GitHub Actions via SSH by using tmate to get access to the runner system itself.

teohhanhui · January 6, 2022, 9:20am

Turns out the stack trace is pretty much useless, as it doesn't cover the async tasks.

welp... I guess a debugger is not going to be much help either:

Debugging a crashdump of an async Rust program requires both intimiate knowledge of executor runtime internals and a certain level of expertise in using debuggers.

github.com

rust-lang/async-crashdump-debugging-initiative/blob/master/CHARTER.md

# 📜 async-crashdump-debugging Charter

The goal of this initiative is to provide tools that simplify debugging crashdumps of async Rust programs.

## Proposal

In particular the initiative strives to provide:

- Debugger plugins for GDB and WinDbg/CDB that help with
  - finding executors from various runtimes (future-rs, Tokio, smol)
  - listing tasks currently being owned by a given executor
  - mapping a task object to the source definition it represents
  - creating logical stack traces that shows dependencies between tasks and resources
- A test framework for debugger plugins
- A guide for writing plugins for other debuggers
- A guide for forking and extending debugging plugins in order to support custom, proprietary executor runtimes
- Integration with tokio-console (?)

Debugging a crashdump of an async Rust program requires both intimiate knowledge of
executor runtime internals and a certain level of expertise in using debuggers.

This file has been truncated. show original

github.com

rust-lang/wg-async/blob/master/src/vision/submitted_stories/shiny_future/grace_debugs_a_crash_dump_again.md

# ✨ Shiny future stories: Grace debugs a crash dump again

## 🚧 Warning: Draft status 🚧

This is a draft "shiny future" story submitted as part of the brainstorming period. It is derived from what actual Rust users wish async Rust should be, and is meant to deal with some of the challenges that Async Rust programmers face today.

If you would like to expand on this story, or adjust the answers to the FAQ, feel free to open a PR making edits (but keep in mind that, as peoples needs and desires for async Rust may differ greatly, shiny future stories [cannot be wrong]. At worst they are only useful for a small set of people or their problems might be better solved with alternative solutions). Alternatively, you may wish to [add your own shiny vision story][htvsq]!

## The story

It's been a few years since the new [DistriData] database has shipped. For the most part things have gone smoothly. The whole team is confident in trusting the compiler, and they have far fewer bugs in production than they had in the old system. The downside is that now when a bug does make it to production, it tends to be really subtle and take a lot of time to get right.

Today when Grace opens her e-mail, she discovers she's been assigned to investigate a dump from a crash that has been occurring in production lately. The crash happens rarely, so it's important to glean as much information as possible. They need to get this fixed soon!

Even though there's a lot of pressure around this situation, Grace is grateful that she won't have to fight her tools to make progress. A lot has changed in Async Rust over the years. The async community got together and defined the Async Debugging Protocol, which provides a standard way for tools to inspect the state of an asynchronous Rust program. Many of the most popular runtimes like Tokio and async-std follow this protocol, and a number of tools have been written to use the protocol as well. Even though Grace's team has opted to build a custom runtime to address their own unique needs, it was not too much work to implement the Async Debugging Protocol and it was well worth it due to the increase in developer productivity. This has truly revolutionized async debugging in much the same way the [Language Server Protocol] did for IDEs.

Upon opening the crash dump, her favorite debugger immediately gives an overview of the state of the program at the point it crashed. It shows what executors are running, how many OS-level threads each executor is using, what tasks are there, and what the state of each task is. For each thread, Grace can see a stack trace and the debugger provides a logical stack trace for each task as well. Many of the resources that the blocked tasks are waiting on are visible too, particularly those provided by the runtime like timers, mutexes, and I/O.

This high level, generic view provides a good start, but the team's custom executor provides additional functionality that the Async Debugging Protocol does not support. Still, using the features already provided as a starting point, Grace was able to write some additional debugging macros to recover the additional state. These macros are used by the whole team and are now a standard part of their debugging toolkit.

This file has been truncated. show original

teohhanhui · January 6, 2022, 9:23am

My best bet seems to be to switch to tokio and use tokio-console?

Considering that async-std seems to be semi-abandoned according to this thread:

teohhanhui · January 6, 2022, 8:33pm

I guess I'm misunderstanding something. Turning on the tracing features of the relevant crates (and setting up tracing before each test) should help, right?

alice · January 6, 2022, 8:40pm

Is that a question for how to use tokio-console?

teohhanhui · January 6, 2022, 9:05pm

It's not. Just trying to figure out how I could debug these test failures.

RedDocMD · January 7, 2022, 3:32am

It might be better to read the code which is invoked by the failing test. Conventional debugging often falls very flat on its face when you have race conditions.

teohhanhui · January 7, 2022, 4:07am

Hmm... That's the thing, I've read the code over and over again, but don't see where it could go wrong. I think tracing would really help. Or having a full stack trace + a sequence diagram

teohhanhui · January 9, 2022, 5:18pm

So I've switched to using tracing instead of println! in the tests.

This particular race condition seems to be caused by the RwLock<VecDeque<_>> where the expected values are stored:

We can see from the log output that the Message::Handshake(_) is received here:

github.com

teohhanhui/callbag-rs/blob/d8d6a50ac5f100c7aa001ab57b42bbfc76d01342/tests/share.rs#L375


      
                  )
              }
          };
          
          
let make_sink_b = {
              let nursery = nursery.clone();
              move || {
                  let talkback = Arc::new(ArcSwapOption::from(None));
                  Arc::new(
                      (move |message| {
                          info!("down (b): {:?}", message);
                          {
                              let downwards_expected_types_b =
                                  &mut *downwards_expected_types_b.write().unwrap();
                              let et = downwards_expected_types_b.pop_front().unwrap();
                              assert!(et.0(&message), "downwards B type is expected: {}", et.1);
                          }
                          if let Message::Handshake(source) = message {
                              talkback.store(Some(source));
                          } else if let Message::Data(data) = message {
                              {

but then the assert happens too late as it's blocked on the RwLock::write call here:

github.com

teohhanhui/callbag-rs/blob/d8d6a50ac5f100c7aa001ab57b42bbfc76d01342/tests/share.rs#L378


      
          
          
let make_sink_b = {
              let nursery = nursery.clone();
              move || {
                  let talkback = Arc::new(ArcSwapOption::from(None));
                  Arc::new(
                      (move |message| {
                          info!("down (b): {:?}", message);
                          {
                              let downwards_expected_types_b =
                                  &mut *downwards_expected_types_b.write().unwrap();
                              let et = downwards_expected_types_b.pop_front().unwrap();
                              assert!(et.0(&message), "downwards B type is expected: {}", et.1);
                          }
                          if let Message::Handshake(source) = message {
                              talkback.store(Some(source));
                          } else if let Message::Data(data) = message {
                              {
                                  let downwards_expected_b = &mut *downwards_expected_b.write().unwrap();
                                  let e = downwards_expected_b.pop_front().unwrap();
                                  assert_eq!(data, e, "downwards B data is expected: {}", e);

i.e. the asserts can happen in the wrong order because of how RwLock works.

teohhanhui · January 11, 2022, 9:13am

Switched to using crossbeam_queue::SegQueue, but now I have a problem with access to the 2 queues not being synchronized:

i.e. downwards_expected_types.pop() and downwards_expected.pop() may get interleaved between threads.

Unfortunately, I've tried but I cannot easily combine the 2 queues.

So, same problem?

teohhanhui · January 16, 2022, 9:18am

Not sure what else I can try anymore:

https://github.com/teohhanhui/callbag-rs/issues/2#issuecomment-1011922051

But will try to add tracing logs for all the operators in my library.

system · April 16, 2022, 9:19am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Cargo test panics with no backtrace	4	700	June 26, 2023
Mysterious panic in macOS cargo test	18	1729	October 26, 2022
GitHub Actions randomly kill a test program help	11	2571	May 12, 2020
Blog post review: Logging errors on OS X help	3	466	January 12, 2023
How to interpret backtrace generated by a panic? help	5	604	March 5, 2023

Related Topics