[Solved] Debugging advice for race condition in release target

I've posted a rather long-winded explanation in a short, self-contained, and correct example (SSCCE) on github.

The gist of it is, my program ain't talking right with my database. It sporadically fails to query the database due to a 'Packet out of order'. AFAICT, this is a MySQL packet that's out of order, not a TCP packet. And it's wily. Put one too many debug! statements in the code, and it acts right again. Appears timing is a factor, which means stepping through the program with a debugger ain't likely to find anything out of place.

I pulled Wireshark out of my toolbox and gave it go. Maybe problem lies on my network or the database server itself. I captured the network traffic on the same computer I ran my SSCCE, and Wireshark said there be Malformd Packets here! Well, long story short, the MySQL dissector in Wireshark twasn't reliable, so I straightened that out, and all the MySQL packets report 8-bit wrapping sequence numbers in proper order. Plus I'm testing against two databases: MySQL 5.5 and MariaDB 10.1.21. Tweren't no network or server issue.

So I've cobbled together this SSCCE hoping to appeal to the community's well of wisdom. How does one go about tracking down a race condition such as this?

Usual disclaimer. Testing with Rust stable (rustc 1.17.0 56124baa9 2017-04-24) and Rust nightly (rustc 1.19.0-nightly f89d8d184 2017-05-30) on ArchLinux. All other details are provided in the SSCCE.

Maybe this is a job for rr? It does involve gdb, which you say you're not comfortable with, but only after the run ends.

1 Like

I'll give rr a shot. I like that it records a run first, then I can step through the save file after. Thanks!

So I installed rr v4.5.0 and gave it a whirl. First run, my SSCCE ran fine. No worries, it has at least a 20% failure rate for me, so I just need to run it a few more times. But wait! rr says it doesn't perform as well with my kernel cpufreq governor, schedutil, and tells me how to switch to performance... done. Let's get back to debugging.

DSN=mysql://testbot:testbot@db.local/fake_data RUST_LOG=bug_test_01=debug rr record ./target/debug/bug-test-01

Ten attempts later, and no "Packet out of order" error. Doesn't seem like rr is going to work. Maybe it just interferes with the timing. I may be jumping to a conclusion here, so I run the SSCCE by itself a couple dozen times or so. No errors. I usually see four by now.

Seriously? My kernel cpufreq governor causes this glitch?

I switched back to schedutil, ran the SSCCE 355 more times, and not a single error. Ok, one last look through my command history to see if I'm missing something.

I am. My earlier commands that exhibited failure ran cargo run --release --bin bug-test-01. I inadvertently switched back to the debug binary when I started using rr.

Oh, the trials and tribulations of my so-called programming career. Let's begin again.

DSN=mysql://testbot:testbot@db.local/fake_data RUST_LOG=bug_test_01=debug rr record ./target/release/bug-test-01

Two runs later, I have a recording exhibiting the 'Packet out of order' error. This is starting to look promising again. Though, I get the feeling by the time I finish this project, I'll be running the debug build in production.

1 Like

Update. This still isn't resolved but I'm making progress at a glacial pace.

The bug only presents itself in the release build, and the release build is optimized (and there was much rejoicing :tada:) which means most of the variables I want to peek at are <optimized out>, which further means that gdb's print var won't work most of the time.

Instead, I step through the assembly, correlate the assembly with the Rust code (which is not-trivially difficult since the release build reordered the machine code), and identify a register expression (e.g. $rbp-0x148) that allows me to see the value otherwise stored in my <optimized out> variable.

Thanks goes out to breeden4 and kmc on #rust irc who introduced me to test::blackbox(&x). I use it to access a variable of interest in my code, recompile, rr record, and track down the asm for the call. It's fairly trivial to break on the test::black_box() line and extract the register expression from the disassembly.

Maybe I can feed it a Rust block expression, and force the optimizer to ignore the code under scrutiny. I might try that later.

One other detour I got snagged on is setting a conditional breakpoint in GDB for my register expression. break <file>:<line> if *($rbp-0x148) > 1000 produced a "generic pointer dereference" error. So I'm thinking C casting is in order (gdb is written in C), but break <file>:<line> if *((unsigned long*)($rbp-0x148)) > 1000 produced a "syntax error".

The correct way to cast in GDB is to use the syntax of the language your debugging. GDB provides a show language command to let you know what language it thinks it's working with. The correct break expression is break <file>:<line> if *(($rbp-0x148) as &usize) > 1000.

When the SSCCE runs correctly, the MySQL packet size never exceeds 1000. The "Packet out of order" error occurs when the program glitches and finds a packet size upwards of 32kB in the MySQL packet header, and goes off to the races. The conditional breakpoint above will stop at this exact moment in the program.

From here, I hope to extract the MySQL packets from a Wireshark capture so I know what is supposed to end up in my program's buffer, and identify whether my program is looking at garbage in the buffer or looking at the wrong bytes in the buffer.

The mysql_async library my program uses has a multiple buffer approach to processing the data from TcpStream. First it reads 4kB chunks (or less) into a local variable let buf = [0u8: 4096];, after each read it copies this local buffer into a std::collections::vec_deque::VecDeque<u8> using its .extend method, then repeats until zero bytes are read from the stream. Then it's onward to parsing the MySQL packet.

I'm trying to step through the source code for mysql_async and vec_deque, but the latter compiled with a source path of /checkout/src/libcollections/vec_deque/veq_deque.rs, which causes problems with the rustup's rust-src component, which doesn't have /checkout in the path. Now I'm off to sort that matter out.

Cheers

You can set source mappings to properly point GDB towards rust-src's copy of the source files: Source Path (Debugging with GDB)

1 Like

I think you're talking about set substitute-path from to. I overlooked that when I learned about directory. Thanks for setting me back on track!

The bug I was pursuing was solved awhile back with the help of @inejge and @Mark_Simulacrum. @inejge asked me to write about the experience and I just posted the article [1].

Be warned. It's a long read, but goes into a fair amount of detail. I hope it will serve to provide others with the information I lacked when I started this topic.

Cheers!

[1]: Debugging a Race Condition in a Release Target

5 Likes

Loved the blog. Thank you for writing it up!

1 Like