How to debug a randomly segfaulting Rust program?

I have a little Rust program that spawns a few scoped threads that normally should run forever. In order to shutdown the threads nicely I catch SIGTERM with the ctrlc. Which then sends a message via crossbeam channel which is tested by the treads as they loop with ctrl_c_events.is_empty().

The problem is that on rare occasions the signal causes the threads to shut down and return nicely but then either the program hangs forever with 100% CPU usage or fails with a setfault.

I do not have any unsafe in the code I have written. There is no normal Rust panic.

I finally figured out how to get a core dump when this fails and use rust-gdb on it to get a stack trace. Which does not show much except that it ends up in somewhere in libc or libpthread. I'd like to at least get a stack trace that shows the last line of Rust code executed before diving into libc or whatever.

Here is my stack trace:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x0000007f7bd3b974 in __GI_abort () at abort.c:79
#2  0x0000007f7bd7472c in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f7be357e0 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x0000007f7bd7aaa4 in malloc_printerr (str=str@entry=0x7f7be30ee0 "malloc_consolidate(): invalid chunk size") at malloc.c:5342
#4  0x0000007f7bd7ada8 in malloc_consolidate (av=av@entry=0x7f64000020) at malloc.c:4471
#5  0x0000007f7bd7c73c in _int_free (av=0x7f64000020, p=0x7f64002c20, have_lock=<optimized out>) at malloc.c:4392
#6  0x0000007f7be19738 in tcache_thread_shutdown () at malloc.c:2979
#7  arena_thread_freeres () at arena.c:950
#8  0x0000007f7be19874 in __libc_thread_freeres () at thread-freeres.c:29
#9  0x0000007f7bf36098 in start_thread (arg=0x7fe665d0cf) at pthread_create.c:476
#10 0x0000007f7bdd80cc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

My whole code is a bit big but basically I run threads like so:

    let (t1, t2, t3) = thread::scope(|scope| {
        let t1 = scope.spawn(|| {
            sig_status_task(&session, ctrl_c_events.clone(), nats_connection.clone())
        });

        let t2 = scope.spawn(|| {
            det_status_task(
                &session,
                ctrl_c_events.clone(),
                nats_connection.clone(),
            )
        });

        let t3 = scope.spawn(|| {
            control_task(&session, &args.pin1, &args.pin2, &ctrl_c_events)
        });

        // Get results of all threads and return
        (t1.join().unwrap(), t2.join().unwrap(), t3.join().unwrap())
    });

    info!(
        "Tasks ended with:\n    t1: {:?}\n    t2: {:?}\n    t3: {:?}\n",
        t1, t2, t3
    );

Signals are caught and messaged to threads like so:

fn ctrl_channel() -> Receiver<()> {
    const SIGNALS: &[c_int] = &[SIGTERM, SIGQUIT, SIGINT, SIGTSTP, SIGHUP, SIGCHLD, SIGCONT];

    let mut sigs = match Signals::new(SIGNALS) {
        Ok(sigs) => sigs,
        Err(e) => {
            error!("creating Signals: {e}");
            std::process::exit(0);
        }
    };
    let (sender, receiver) = bounded(100);

    spawn(move || loop {
        if let Some(signal) = sigs.into_iter().next() {
            info!("Got signal {}", signal);
            if let Err(e) = sender.send(()) {
                error!("Signal handler thread failed {}", e);
                std::process::exit(0);
            }
        }
    });
    receiver
}

The thread functions get those messages and shut down nicely like so:

                loop {
                    // Lots of stuff...
                    if !ctrl_c_events.is_empty() {
                        warn!("Got Signal.");

                        // Clean up nicely here...

                        std::thread::sleep(Duration::from_secs(1));
                        info!("waking and returning ...");
                        return Err(e.into());
                    }
                }

My log messages show that all treads make it to that info!"waking and returning...") output but never make it through the following return

Any advice would be appreciated.

The stack trace points to one of two things:

  1. Heap corruption somewhere (presumably caused by a dependency of yours with unsafe code).
  2. A bug in your libc's malloc implementation.

You could try jemallocator to switch out the malloc implementation for jemalloc. Otherwise, it's time to run under valgrind and see if you can identify the piece of code that's corrupting the heap.

1 Like

Ah yes, valgrind, it's quite a few years since I used that last.

Meanwhile running on my MacBook Pro M1 things seem to get a bit further, it returns to my main() and issues the last log messages in the program, but then doe this on leaving main():

[2024-08-07T13:30:53Z INFO  rust_ssh_test] Tasks ended with:
        t1: Err(Signal recieved)
        t2: Err(Signal recieved)
        t3: Err(Signal recieved)
    
rust_ssh_test(48424,0x1d70a1c40) malloc: *** error for object 0x60000257b180: pointer being freed was not allocated
rust_ssh_test(48424,0x1d70a1c40) malloc: *** set a breakpoint in malloc_error_break to debug
./run.sh: line 11: 48424 Abort trap: 6           RUST_LOG=info ./target/debug/rust_ssh_test -s jetson-itc:2222 -n jetson-itc -u ... other args

I have no idea how to debug on the Mac yet.

1 Like

Using jemallocator did not seem to help much.

Running under valgrind like so:

RUST_LOG=debug valgrind --leak-check=yes ./target/debug/rust_ssh_test -s "192.168.0.75:22" -u secret -p secret

does not produce any segfault having done it 50 or more times.

However there are a ton of other errors coming out of valgrind like so:

==28795== Memcheck, a memory error detector
==28795== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28795== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==28795== Command: ./target/debug/rust_ssh_test  <secrets deleted here>
==28795== 
[2024-08-07T14:07:40Z INFO  rust_ssh_test] rust_ssh_test
[2024-08-07T14:07:42Z INFO  rust_ssh_test] Using: SSH address: 192.168.0.75:22
[2024-08-07T14:07:42Z INFO  rust_ssh_test]        NATS address: localhost
[2024-08-07T14:07:42Z INFO  rust_ssh_test]        controlling:  false
==28795== Invalid write of size 8
==28795==    at 0x759E68: core::str::pattern::StrSearcher::new (pattern.rs:1052)
==28795==    by 0x231C8B: into_searcher (pattern.rs:957)
==28795==    by 0x231C8B: <&str as core::str::pattern::Pattern>::is_contained_in (pattern.rs:986)
==28795==    by 0x2019BB: core::str::<impl str>::contains (mod.rs:1150)
==28795==    by 0x22382B: <nats::connector::ServerAddress as core::str::traits::FromStr>::from_str (connector.rs:589)
==28795==    by 0x201973: core::str::<impl str>::parse (mod.rs:2425)
==28795==    by 0x224097: <&str as nats::connector::IntoServerList>::into_server_list::{{closure}} (connector.rs:694)
==28795==    by 0x21CFDF: core::iter::adapters::map::map_try_fold::{{closure}} (map.rs:96)
==28795==    by 0x215E6B: core::iter::traits::iterator::Iterator::try_fold (iterator.rs:2411)
==28795==    by 0x21B32B: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold (map.rs:122)
==28795==    by 0x248283: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold (mod.rs:204)
==28795==    by 0x247FCF: try_for_each<core::iter::adapters::GenericShunt<core::iter::adapters::map::Map<core::str::iter::Split<char>, nats::connector::{impl#8}::into_server_list::{closure_env#0}>, core::result::Result<core::convert::Infallible, std::io::error::Error>>, fn(nats::connector::ServerAddress) -> core::ops::control_flow::ControlFlow<nats::connector::ServerAddress, ()>, core::ops::control_flow::ControlFlow<nats::connector::ServerAddress, ()>> (iterator.rs:2473)
==28795==    by 0x247FCF: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next (mod.rs:187)
==28795==    by 0x2134CB: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter (spec_from_iter_nested.rs:26)
==28795==  Address 0x1ffeffdec0 is on thread 1's stack
==28795==  112 bytes below stack pointer
==28795== 
--28795-- WARNING: unhandled arm64-linux syscall: 291
--28795-- You may be able to write your own handler.
--28795-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--28795-- Nevertheless we consider this a bug.  Please report
--28795-- it at http://valgrind.org/support/bug_reports.html.
--28795-- WARNING: unhandled arm64-linux syscall: 291
--28795-- You may be able to write your own handler.
--28795-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--28795-- Nevertheless we consider this a bug.  Please report
--28795-- it at http://valgrind.org/support/bug_reports.html.
    ...
    Loads of program output here....
    ...
==28795== Thread 7:
==28795== Invalid write of size 8
==28795==    at 0x759E68: core::str::pattern::StrSearcher::new (pattern.rs:1052)
==28795==    by 0x6020C3: <&str as core::str::pattern::Pattern>::into_searcher (pattern.rs:957)
==28795==    by 0x5FD3A3: core::str::<impl str>::match_indices (mod.rs:1880)
==28795==    by 0x600E1F: colored::ColoredString::escape_inner_reset_sequences (lib.rs:437)
==28795==    by 0x601337: <colored::ColoredString as core::fmt::Display>::fmt (lib.rs:619)
==28795==    by 0x75701F: fmt (rt.rs:165)
==28795==    by 0x75701F: core::fmt::write (mod.rs:1168)
==28795==    by 0x19260F: <&T as core::fmt::Display>::fmt (mod.rs:2354)
==28795==    by 0x75701F: fmt (rt.rs:165)
==28795==    by 0x75701F: core::fmt::write (mod.rs:1168)
==28795==    by 0x191913: std::io::Write::write_fmt (mod.rs:1835)
==28795==    by 0x19812F: env_logger::fmt::DefaultFormat::write_args (mod.rs:485)
==28795==    by 0x1974F7: env_logger::fmt::DefaultFormat::write (mod.rs:325)
==28795==    by 0x196EAB: env_logger::fmt::Builder::build::{{closure}} (mod.rs:249)
==28795==  Address 0x663a350 is on thread 7's stack
==28795==  112 bytes below stack pointer
==28795== 
[2024-08-07T14:09:53Z INFO  rust_ssh_test] 

[2024-08-07T14:09:55Z INFO  rust_ssh_test] SSH Task ended OK.
==28795== 
==28795== HEAP SUMMARY:
==28795==     in use at exit: 8,569,774 bytes in 367 blocks
==28795==   total heap usage: 9,538 allocs, 9,171 frees, 11,687,483 bytes allocated
==28795== 
==28795== Thread 1:
==28795== 108 bytes in 1 blocks are definitely lost in loss record 1 of 7
==28795==    at 0x4845BFC: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==28795== 
==28795== 864 bytes in 3 blocks are possibly lost in loss record 2 of 7
==28795==    at 0x4847B0C: calloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==28795== 
==28795== 18,772 bytes in 18 blocks are possibly lost in loss record 5 of 7
==28795==    at 0x4845BFC: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==28795== 
==28795== LEAK SUMMARY:
==28795==    definitely lost: 108 bytes in 1 blocks
==28795==    indirectly lost: 0 bytes in 0 blocks
==28795==      possibly lost: 19,636 bytes in 21 blocks
==28795==    still reachable: 8,550,030 bytes in 345 blocks
==28795==         suppressed: 0 bytes in 0 blocks
==28795== Reachable blocks (those to which a pointer was found) are not shown.
==28795== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==28795== 
==28795== For counts of detected and suppressed errors, rerun with: -v
==28795== ERROR SUMMARY: 19 errors from 5 contexts (suppressed: 0 from 0)

Suspicious there is a couple of " Invalid write of size 8" and "WARNING: unhandled arm64-linux syscall: 291". I don't know what to make of it all.

291 is epoll_create1 and the report claims that Valgrind doesn't support it. But it's impossible to use epoll_create1 in a way that will cause memory unsafety, so you can ignore that message.

(It's weird though: support for this syscall appears to have been added to Valgrind in 2009, so even though you're running a very old Valgrind, I don't understand why it doesn't support this.)

1 Like

Advice for running Rust cost under Valgrind is to use at least 3.19, because that's the first version that had Rust-specific bugfixes in it.

That said, those "invalid writes" are worth looking into. They look like cases where a parameter has been corrupted such that the return of "normal" string searching functions is put in the wrong place.

1 Like

My plan for tomorrow is to remove everything from that little program except the threads and signal handling and see how it goes. If that is OK then start adding back logging, my use of ssh and then NATS and so on.

I have a "chaos monkey" program that can run this thing and then kill it with SIGTERM after random time intervals. Which currently produces that segfault occasionally at random.

1 Like

That valgrind is whatever came with the Ubuntu 18.04.6 that Nvidia provided for the Jetson Nano a couple of years back. Perhaps they have a newer release.

It's good to know I can ignore that syscall: 291, one less thing to worry about.

I noticed some "Invalid write" messages that referred to the coloured crate. So I removed use of that and set my test running....

1 Like

If you're looking for the dep using unsafe, cargo-geiger might help a bit.

1 Like