Thread::park_timeout deadlock caused by reentring call to thread mutex?


#1

I have some simple multithreaded code in which newly spawned threads are parked using thread::park_timeout (simplified version below):

while !self.shared_state.lock().unwrap().ready {
  thread::park_timeout(Duration::from_millis(100)) ;
}

The application will typically spawn several threads, all of which will end up parked in this loop. More often than not, these threads will never leave thread::park_timeout.

When looking at the thread’s stack traces, it looks like the threads are in a deadlock caused by thread::park_timeout's call to ConvVar::wait_timeout, which in turn calls sync::Once when fetching duration data. My understanding is as follows, for two threads A & B:

  1. A & B’s wait_timeout calls run in parallel and eventually reach Once::call_once
  2. The “selected” thread – let’s say A – gets to run its closure, while B gets parked. This latter call to thread::park is blocked on B’s inner mutex lock since this is a reentring call.
  3. Once A finishes running its code, the cleanup process tries to unpark waiting threads, thereby trying to access B’s inner mutex lock, which is still blocked, thus resulting in a deadlock.

Unless I misunderstood something, this looks like a standard lib bug. What do you guys think?

You’ll find below the typical stack traces I’m getting:

Thread locked in unpark

#0	0x000000011189bc22 in __psynch_mutexwait ()
#1	0x00000001118d0dfa in _pthread_mutex_lock_wait ()
#2	0x000000010f7f9d59 in std::sys::imp::mutex::{{impl}}::lock [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sys/unix/mutex.rs:67
#3	0x000000010f7f9d54 in std::sys_common::mutex::{{impl}}::lock [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sys_common/mutex.rs:40
#4	0x000000010f7f9d54 in std::sync::mutex::{{impl}}::lock<bool> [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sync/mutex.rs:222
#5	0x000000010f7f9d50 in std::thread::{{impl}}::unpark at /Users/travis/build/rust-lang/rust/src/libstd/thread/mod.rs:992
#6	0x000000010f816528 in std::sync::once::{{impl}}::drop at /Users/travis/build/rust-lang/rust/src/libstd/sync/once.rs:380
#7	0x000000010f8163fa in core::ptr::drop_in_place<std::sync::once::Finish> [inlined] at /Users/travis/build/rust-lang/rust/src/libcore/ptr.rs:61
#8	0x000000010f8163f5 in std::sync::once::{{impl}}::call_inner at /Users/travis/build/rust-lang/rust/src/libstd/sync/once.rs:310
#9	0x000000010f816aa3 in std::sync::once::{{impl}}::call_once<closure> [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sync/once.rs:227
#10	0x000000010f816a7e in std::sys::imp::time::inner::info [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sys/unix/time.rs:233
#11	0x000000010f816a7e in std::sys::imp::time::inner::{{impl}}::sub_instant [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sys/unix/time.rs:143
#12	0x000000010f816a7e in std::time::{{impl}}::duration_since [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/time/mod.rs:181
#13	0x000000010f816a7e in std::time::{{impl}}::sub [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/time/mod.rs:246
#14	0x000000010f816a7e in std::time::{{impl}}::elapsed at /Users/travis/build/rust-lang/rust/src/libstd/time/mod.rs:205
#15	0x000000010f81a025 in std::sys::imp::condvar::{{impl}}::wait_timeout at /Users/travis/build/rust-lang/rust/src/libstd/sys/unix/condvar.rs:160
#16	0x000000010f7f980b in std::sys_common::condvar::{{impl}}::wait_timeout [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sys_common/condvar.rs:61
#17	0x000000010f7f97fa in std::sync::condvar::{{impl}}::wait_timeout<bool> [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sync/condvar.rs:346
#18	0x000000010f7f97ca in std::thread::park_timeout at /Users/travis/build/rust-lang/rust/src/libstd/thread/mod.rs:839

Thread locked in park

#0	0x000000011189bc22 in __psynch_mutexwait ()
#1	0x00000001118d0dfa in _pthread_mutex_lock_wait ()
#2	0x000000010f7f948d in std::sys::imp::mutex::{{impl}}::lock [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sys/unix/mutex.rs:67
#3	0x000000010f7f9488 in std::sys_common::mutex::{{impl}}::lock [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sys_common/mutex.rs:40
#4	0x000000010f7f9488 in std::sync::mutex::{{impl}}::lock<bool> [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sync/mutex.rs:222
#5	0x000000010f7f9484 in std::thread::park at /Users/travis/build/rust-lang/rust/src/libstd/thread/mod.rs:766
#6	0x000000010f8162e5 in std::sync::once::{{impl}}::call_inner at /Users/travis/build/rust-lang/rust/src/libstd/sync/once.rs:341
#7	0x000000010f816aa3 in std::sync::once::{{impl}}::call_once<closure> [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sync/once.rs:227
#8	0x000000010f816a7e in std::sys::imp::time::inner::info [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sys/unix/time.rs:233
#9	0x000000010f816a7e in std::sys::imp::time::inner::{{impl}}::sub_instant [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sys/unix/time.rs:143
#10	0x000000010f816a7e in std::time::{{impl}}::duration_since [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/time/mod.rs:181
#11	0x000000010f816a7e in std::time::{{impl}}::sub [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/time/mod.rs:246
#12	0x000000010f816a7e in std::time::{{impl}}::elapsed at /Users/travis/build/rust-lang/rust/src/libstd/time/mod.rs:205
#13	0x000000010f81a025 in std::sys::imp::condvar::{{impl}}::wait_timeout at /Users/travis/build/rust-lang/rust/src/libstd/sys/unix/condvar.rs:160
#14	0x000000010f7f980b in std::sys_common::condvar::{{impl}}::wait_timeout [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sys_common/condvar.rs:61
#15	0x000000010f7f97fa in std::sync::condvar::{{impl}}::wait_timeout<bool> [inlined] at /Users/travis/build/rust-lang/rust/src/libstd/sync/condvar.rs:346
#16	0x000000010f7f97ca in std::thread::park_timeout at /Users/travis/build/rust-lang/rust/src/libstd/thread/mod.rs:839

#2

The two stacks you show; neither owns the lock and both are waiting to acquire it. You would need to show a thread that owns the lock and does not release it for there to be deadlock.


#3

@jonh, the stacks don’t show the owned locks, but why would they? There’s nothing in Rust that translates this notion in terms of method calls, as far as I understand the language.

As mentioned in my post, my assumption about the deadlock has more to do with the way the sync::Once code handles parking/unparking in conjunction with thread::park_timeout.


#4

I’ve tried to reproduce the problem: strangely enough, the following code works in the playground (i.e., threads get unmarked), not on my setup (MBP):

https://play.rust-lang.org/?gist=3e2eef0e9a60901cd42277d8b3989254&version=undefined

I’ve tested this code on yesterday’s nightly an old nightly from back in March, as well as on a stable from July 17th, which all display the following:

About to park thread
About to park thread
Done

… apparently confirming that they never get pulled out of park_timeout.


#5

Perhaps your MBP kernel scheduler doesn’t schedule the threads in the same (timely) manner as whatever the playground uses. The code is inherently racy by virtue of using sleeps to “synchronize” the threads.

Curious why you’re using park_timeout without the unpark calls? Also unclear why a Mutex<bool> is used instead of AtomicBool? If you’re going to use a Mutex then a Condvar for signaling (instead of timed parking) seems better. Or maybe I misunderstood something.

What are you trying to achieve in general?


#6

I’ve tested the code on a recent(ish) Linux distro and the bug couldn’t be reproduced, so the OSX scheduler is indeed probably involved in the problem. For what it’s worth, the bug also appears on iOS 9.x & 10.x, but that shouldn’t be a surprise to anyone.

Regarding the unpark calls, I’ve left them out of the sample but the actual code effectively includes a thread unparking the threads once the ready state is set. Likewise, the Mutex<bool> also comes from the actual code, in which the mutex is gating the access to a struct state with a boolean attribute for the ready state. The calls to sleep are just a hack to reproduce more systematically the conditions in which the bug appears.

I’ve tested a variant using a Condvar bound to the existing mutex instead of park_timeout, which does solve the problem: thanks @vitalyd for the tip! Looking again at park_timeout's implementation, it really looks like the deadlock was coming from the fact that this method uses the thread’s inner mutex and condvar, the former being also used by sync::Once to gate the execution of its closure.

To give some additional context: I’ve come across this bug while implementing a Rust client for NSLogger. Basically the the logger delegates all the heavy lifting (e.g., building a binary log message and sending it to the log viewer through a tcp connection) to a dedicated worker thread. This worker is initialised upon the first call to log: while the worker thread is being set up, all the logging threads are put to sleep using park_timeout, and the worker thread wakes them up (using unpark) once it’s in a state to handle the calls.


#7

Glad the Condvar solution works for you. That’s probably the more appropriate mechanism for your case.

As for the deadlock, I’ve not looked at the code you pointed out in detail. Are you able to reproduce that with some relatively minimal example code? You can file a github issue against rust and have someone from the stdlib team look at it. Although unless the bug is obvious from looking at the code, they may want to see a repro as well. But, it may not hurt to have a discussion around it anyway. If you do create a github issue, please post the link here :slight_smile:.