I'm quite lost after trying to debug this issue for the best part of a day and making no real progress. The code for this is on GitHub. The core issue is that I'm exposing a a struct like the following to a C caller (which happens to be a Go program):
pub struct CPointer<T> {
inner: RwLock<Option<T>>,
last_error: Mutex<HashMap<ThreadId, Option<Error>>>,
}
(last_error
is effectively an errno
-like value associated with each object with one-error-per-thread.)
The issue is that (with enough Go routines) I can trigger a segfault when the code attempts to do self.last_error.lock()
. This Rust code is being exposed and used through a C FFI API, so there's plenty of unsafe code but I can't figure out which Rust safety rule I've violated to trigger the segfault (I'm really not doing anything too crazy). If you run the program under a debugger you get some very odd results:
(lldb) bt
* thread #1, name = 'cat', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
* frame #0: 0x00007ffff7e96840 libpthread.so.0`__GI___pthread_mutex_lock
frame #1: 0x00007ffff7eeafaa libpathrs.so.0`std::sys::unix::mutex::Mutex::lock::h90bdeb011d87ab3f(self=0xf871090953639369) at mutex.rs:55:16
frame #2: 0x00007ffff7f133a1 libpathrs.so.0`std::sys_common::mutex::Mutex::raw_lock::h3c4c580d9c7b8a98(self=0xf871090953639369) at mutex.rs:36:36
frame #3: 0x00007ffff7ef75a5 libpathrs.so.0`std::sync::mutex::Mutex$LT$T$GT$::lock::h8d609f61f2692a57(self=0x00000000005a9088) at mutex.rs:220:12
frame #4: 0x00007ffff7ee3f6f libpathrs.so.0`pathrs::capi::utils::CPointer$LT$T$GT$::do_wrap_err::hcdb26d4b0b6604d9(self=0x00000000005a9060, c_error=-1, func=closure-0 @ 0x00007fffffffd7b0) at utils.rs:195:8
frame #5: 0x00007ffff7ee75da libpathrs.so.0`pathrs::capi::utils::CPointer$LT$T$GT$::take_wrap_err::hf79fe63c5a065fa7(self=0x00000000005a9060, c_error=-1, func=closure-1 @ 0x00007fffffffdb68) at utils.rs:225:8
frame #6: 0x00007ffff7f20258 libpathrs.so.0`pathrs_into_fd(ptr_type=PATHRS_HANDLE, ptr=0x00000000005a9060) at transmute.rs:133:12
frame #7: 0x00000000004a29c4 cat`_cgo_84a3de375210_Cfunc_pathrs_into_fd + 53
frame #8: 0x0000000000452ed0 cat`runtime.asmcgocall at asm_amd64.s:637
frame #9: 0x0000000000450153 cat`runtime.newdefer.func2 at panic.go:242
frame #10: 0x00000000004516e6 cat`runtime.systemstack at asm_amd64.s:351
frame #11: 0x000000000042e290 cat at proc.go:1146
frame #12: 0x000000000049f27d cat`github.com/openSUSE/libpathrs/contrib/bindings/go/pathrs._Cfunc_pathrs_into_fd at _cgo_gotypes.go:232
frame #13: 0x00000000004a1091 cat`github.com/openSUSE/libpathrs/contrib/bindings/go/pathrs.(*Handle).IntoRaw at pathrs.go:403
frame #14: 0x00000000004a237b cat`main.main.func1 at cat.go:118
frame #15: 0x0000000000453751 cat`runtime.goexit at asm_amd64.s:1333
Huh, 0xf871090953639369 is a very odd pointer for a structure (the pointer changes on each run). Looking at the last frame where the self pointer makes sense leads to even stranger results:
(lldb) up 3
frame #3: 0x00007ffff7ef75a5 libpathrs.so.0`std::sync::mutex::Mutex$LT$T$GT$::lock::h8d609f61f2692a57(self=0x00000000005a9088) at mutex.rs:220:12
(lldb) print *self
(std::sync::mutex::Mutex<std::collections::hash::map::HashMap<std::thread::ThreadId, core::option::Option<pathrs::error::Error>, std::collections::hash::map::RandomState> >) $0 = {
inner = 0xf871090953639369
poison = {
failed = {
v = (value = '\x1a')
}
}
data = {
value = {
base = {
hash_builder = (k0 = 0, k1 = 140737353725968)
table = {
bucket_mask = 8
ctrl = (pointer = 0x0000000000000000)
data = {
pointer = 0x0000000000000000
}
growth_left = 0
items = 449
marker = {}
}
}
}
}
}
And this is where I've been stuck for a few hours at least. The inner pointer (which is a Box
) has clearly been corrupted by something, not to mention the AtomicBool
contains a strange value each time I run thsi example (it's a "boolean" but contains some random byte value.
Note that while this technically might count as a poisoned mutex, my understanding of the Mutex code in the standard library is that the inner Box
of the Mutex
isn't modified when the lock gets poisoned -- so something else is triggering this corruption (it's not related to poisoning -- I checked by wrapping every self.last_error.lock()
with catch_unwind()
s and nothing hit them).
Any hints? For the Go program, all C calls are done with runtime.LockOSThread
to stop thread migrations from confusing everything, but it's quite concerning that with a few hundred threads I start getting pretty bad-looking corruption bugs.