We received a bug report in the bytes repo about thread sanitizer detecting a race. Initially this was just because we used fences which TSAN didn't support, but replacing the fence with an aquire load (like what Arc does) did not make TSAN happy. Using AcqRel for the fetch_sub
does make TSAN happy.
That is, TSAN does not like when we do this:
unsafe fn release_shared(ptr: *mut Shared) {
// Drop ref-count.
if (*ptr).ref_cnt.fetch_sub(1, Ordering::Release) != 1 {
return;
}
// TSAN replacement for fence.
(*ptr).ref_count.load(Ordering::Acquire);
// Deallocate memory.
Box::from_raw(ptr);
}
But TSAN is happy with this:
unsafe fn release_shared(ptr: *mut Shared) {
// Drop ref-count.
if (*ptr).ref_cnt.fetch_sub(1, Ordering::AcqRel) != 1 {
return;
}
// Deallocate memory.
Box::from_raw(ptr);
}
I tried thinking about what could possibly cause this, but I can only think of situations where both are correct, and situations where both are wrong. I don't see how only one of them can be correct.
I wrote a small test application to compare the two approaches, which you can find in the collapsed block below. However, TSAN is perfectly happy with both approaches in the example. (And is not happy if you remove the fence without making the fetch_sub
use AcqRel
.)
Click to expand!
use std::cell::UnsafeCell;
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
struct RefcountTest {
counter: AtomicUsize,
value: UnsafeCell<u64>,
}
unsafe impl Send for RefcountTest {}
unsafe impl Sync for RefcountTest {}
impl RefcountTest {
fn read_value(&self) -> u64 {
let ptr = self.value.get();
unsafe { *ptr }
}
fn do_test(&self) {
if self.counter.fetch_sub(1, Ordering::Release) == 1 {
// TSAN fence
self.counter.load(Ordering::Acquire);
// Ref-count has dropped to zero, so all other users of the value
// are gone.
//
// If thread-sanitizer doesn't understand the fence, then this
// write should be a data race.
let ptr = self.value.get();
unsafe {
*ptr = 20;
}
}
}
}
const THREADS: usize = 10;
fn main() {
let test = Arc::new(RefcountTest {
counter: AtomicUsize::new(THREADS),
value: UnsafeCell::new(10),
});
let mut handles = Vec::with_capacity(THREADS);
for _ in 0..THREADS {
let test = test.clone();
let join = std::thread::spawn(move || {
let value = test.read_value();
test.do_test();
value
});
handles.push(join);
}
let mut values = Vec::with_capacity(THREADS);
for handle in handles {
values.push(handle.join().unwrap());
}
println!("Thread values are: {:?}", values);
println!("At the end, value is {}", test.read_value());
}
You can use TSAN like this:
export RUSTFLAGS=-Zsanitizer=thread RUSTDOCFLAGS=-Zsanitizer=thread
cargo +nightly run -Zbuild-std --target x86_64-unknown-linux-gnu
Do you have any idea what situation could cause one of the above snippets to be a data race, without making both be a data race?
The bytes repo issue is #540 and my testing PR is #541. See also the boost documentation on implementing reference counting.