Mutex appears to have corrupted inner pointer and poisoned value

cyphar · February 6, 2020, 5:32pm

I'm quite lost after trying to debug this issue for the best part of a day and making no real progress. The code for this is on GitHub. The core issue is that I'm exposing a a struct like the following to a C caller (which happens to be a Go program):

pub struct CPointer<T> {
    inner: RwLock<Option<T>>,
    last_error: Mutex<HashMap<ThreadId, Option<Error>>>,
}

(last_error is effectively an errno-like value associated with each object with one-error-per-thread.)

The issue is that (with enough Go routines) I can trigger a segfault when the code attempts to do self.last_error.lock(). This Rust code is being exposed and used through a C FFI API, so there's plenty of unsafe code but I can't figure out which Rust safety rule I've violated to trigger the segfault (I'm really not doing anything too crazy). If you run the program under a debugger you get some very odd results:

(lldb) bt
* thread #1, name = 'cat', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
  * frame #0: 0x00007ffff7e96840 libpthread.so.0`__GI___pthread_mutex_lock
    frame #1: 0x00007ffff7eeafaa libpathrs.so.0`std::sys::unix::mutex::Mutex::lock::h90bdeb011d87ab3f(self=0xf871090953639369) at mutex.rs:55:16
    frame #2: 0x00007ffff7f133a1 libpathrs.so.0`std::sys_common::mutex::Mutex::raw_lock::h3c4c580d9c7b8a98(self=0xf871090953639369) at mutex.rs:36:36
    frame #3: 0x00007ffff7ef75a5 libpathrs.so.0`std::sync::mutex::Mutex$LT$T$GT$::lock::h8d609f61f2692a57(self=0x00000000005a9088) at mutex.rs:220:12
    frame #4: 0x00007ffff7ee3f6f libpathrs.so.0`pathrs::capi::utils::CPointer$LT$T$GT$::do_wrap_err::hcdb26d4b0b6604d9(self=0x00000000005a9060, c_error=-1, func=closure-0 @ 0x00007fffffffd7b0) at utils.rs:195:8
    frame #5: 0x00007ffff7ee75da libpathrs.so.0`pathrs::capi::utils::CPointer$LT$T$GT$::take_wrap_err::hf79fe63c5a065fa7(self=0x00000000005a9060, c_error=-1, func=closure-1 @ 0x00007fffffffdb68) at utils.rs:225:8
    frame #6: 0x00007ffff7f20258 libpathrs.so.0`pathrs_into_fd(ptr_type=PATHRS_HANDLE, ptr=0x00000000005a9060) at transmute.rs:133:12
    frame #7: 0x00000000004a29c4 cat`_cgo_84a3de375210_Cfunc_pathrs_into_fd + 53
    frame #8: 0x0000000000452ed0 cat`runtime.asmcgocall at asm_amd64.s:637
    frame #9: 0x0000000000450153 cat`runtime.newdefer.func2 at panic.go:242
    frame #10: 0x00000000004516e6 cat`runtime.systemstack at asm_amd64.s:351
    frame #11: 0x000000000042e290 cat at proc.go:1146
    frame #12: 0x000000000049f27d cat`github.com/openSUSE/libpathrs/contrib/bindings/go/pathrs._Cfunc_pathrs_into_fd at _cgo_gotypes.go:232
    frame #13: 0x00000000004a1091 cat`github.com/openSUSE/libpathrs/contrib/bindings/go/pathrs.(*Handle).IntoRaw at pathrs.go:403
    frame #14: 0x00000000004a237b cat`main.main.func1 at cat.go:118
    frame #15: 0x0000000000453751 cat`runtime.goexit at asm_amd64.s:1333

Huh, 0xf871090953639369 is a very odd pointer for a structure (the pointer changes on each run). Looking at the last frame where the self pointer makes sense leads to even stranger results:

(lldb) up 3
frame #3: 0x00007ffff7ef75a5 libpathrs.so.0`std::sync::mutex::Mutex$LT$T$GT$::lock::h8d609f61f2692a57(self=0x00000000005a9088) at mutex.rs:220:12
(lldb) print *self
(std::sync::mutex::Mutex<std::collections::hash::map::HashMap<std::thread::ThreadId, core::option::Option<pathrs::error::Error>, std::collections::hash::map::RandomState> >) $0 = {
  inner = 0xf871090953639369
  poison = {
    failed = {
      v = (value = '\x1a')
    }
  }
  data = {
    value = {
      base = {
        hash_builder = (k0 = 0, k1 = 140737353725968)
        table = {
          bucket_mask = 8
          ctrl = (pointer = 0x0000000000000000)
          data = {
            pointer = 0x0000000000000000
          }
          growth_left = 0
          items = 449
          marker = {}
        }
      }
    }
  }
}

And this is where I've been stuck for a few hours at least. The inner pointer (which is a Box) has clearly been corrupted by something, not to mention the AtomicBool contains a strange value each time I run thsi example (it's a "boolean" but contains some random byte value.

Note that while this technically might count as a poisoned mutex, my understanding of the Mutex code in the standard library is that the inner Box of the Mutex isn't modified when the lock gets poisoned -- so something else is triggering this corruption (it's not related to poisoning -- I checked by wrapping every self.last_error.lock() with catch_unwind()s and nothing hit them).

Any hints? For the Go program, all C calls are done with runtime.LockOSThread to stop thread migrations from confusing everything, but it's quite concerning that with a few hundred threads I start getting pretty bad-looking corruption bugs.

cuviper · February 6, 2020, 6:43pm

There's a whole lot of safety riding on "the C caller has assured us..." -- my wild guess is that somewhere you're not actually getting correctly-matched CPointerType + CPointer<T>.

Maybe you can run under rr? I don't know how well that works in multithreaded programs, but if you could actually rewind to the point where the memory location of that Box pointer was written, then you could see if it was being treated as the same type at that time.

cyphar · February 7, 2020, 3:49am

There's a whole lot of safety riding on "the C caller has assured us..." -- my wild guess is that somewhere you're not actually getting correctly-matched CPointerType + CPointer<T> .

Sure I appreciate that's probably the linchpin of this library being safe-to-use, but the C caller in this context is the Go bindings which have helpers precisely to make sure that CPointerType is the right value for a given Go object (which contains the C FFI pointer for a libpathrs object).

Also note that the test program works if you run it with much fewer goroutines -- so if it's as simple a bug as "the type is wrong" then why wouldn't it show up if you reduced the parallelism (the Mutex still needs to get locked even if there's only a single goroutine). I've put the test program in a gist (though to actually compile and run it, you'd need to mess with your GO_LDFLAGS and LD_LIBRARY_PATH).

I'll give rr a shot though.

cyphar · February 8, 2020, 3:18pm

The bug was this line (obvious as soon as you notice it). It was a copy-paste bug:

// CRoot == CPointer<Root>
let handle = unsafe { &*(ptr as *const CRoot) };

This should've been (and this fixes the issue):

// CHandle == CPointer<Handle>
let handle = unsafe { &*(ptr as *const CHandle) };

system · May 8, 2020, 3:18pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is there a way to know where a mutex was poisoned from?	8	596	December 28, 2020
FFI callback and segfault help	4	2120	January 12, 2023
Passing pointer to rust function to C FFI help	8	862	August 1, 2022
Using Rust's advanced memory management features without invalidating a bound C library's references help	7	1479	May 17, 2020
Mutex Poisoning: why, and how to recover help	18	4076	May 26, 2022

Mutex appears to have corrupted inner pointer and poisoned value

Related Topics