Ref-counting and thread sanitizer

We received a bug report in the bytes repo about thread sanitizer detecting a race. Initially this was just because we used fences which TSAN didn't support, but replacing the fence with an aquire load (like what Arc does) did not make TSAN happy. Using AcqRel for the fetch_sub does make TSAN happy.

That is, TSAN does not like when we do this:

unsafe fn release_shared(ptr: *mut Shared) {
    // Drop ref-count.
    if (*ptr).ref_cnt.fetch_sub(1, Ordering::Release) != 1 {
        return;
    }

    // TSAN replacement for fence.
    (*ptr).ref_count.load(Ordering::Acquire);

    // Deallocate memory.
    Box::from_raw(ptr);
}

But TSAN is happy with this:

unsafe fn release_shared(ptr: *mut Shared) {
    // Drop ref-count.
    if (*ptr).ref_cnt.fetch_sub(1, Ordering::AcqRel) != 1 {
        return;
    }

    // Deallocate memory.
    Box::from_raw(ptr);
}

I tried thinking about what could possibly cause this, but I can only think of situations where both are correct, and situations where both are wrong. I don't see how only one of them can be correct.

I wrote a small test application to compare the two approaches, which you can find in the collapsed block below. However, TSAN is perfectly happy with both approaches in the example. (And is not happy if you remove the fence without making the fetch_sub use AcqRel.)

Click to expand!
use std::cell::UnsafeCell;
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};

struct RefcountTest {
    counter: AtomicUsize,
    value: UnsafeCell<u64>,
}

unsafe impl Send for RefcountTest {}
unsafe impl Sync for RefcountTest {}

impl RefcountTest {
    fn read_value(&self) -> u64 {
        let ptr = self.value.get();
        unsafe { *ptr }
    }

    fn do_test(&self) {
        if self.counter.fetch_sub(1, Ordering::Release) == 1 {

            // TSAN fence
            self.counter.load(Ordering::Acquire);

            // Ref-count has dropped to zero, so all other users of the value
            // are gone.
            //
            // If thread-sanitizer doesn't understand the fence, then this
            // write should be a data race.
            let ptr = self.value.get();
            unsafe {
                *ptr = 20;
            }
        }
    }
}

const THREADS: usize = 10;

fn main() {
    let test = Arc::new(RefcountTest {
        counter: AtomicUsize::new(THREADS),
        value: UnsafeCell::new(10),
    });

    let mut handles = Vec::with_capacity(THREADS);
    for _ in 0..THREADS {
        let test = test.clone();
        let join = std::thread::spawn(move || {
            let value = test.read_value();
            test.do_test();
            value
        });
        handles.push(join);
    }

    let mut values = Vec::with_capacity(THREADS);
    for handle in handles {
        values.push(handle.join().unwrap());
    }

    println!("Thread values are: {:?}", values);
    println!("At the end, value is {}", test.read_value());
}

You can use TSAN like this:

export RUSTFLAGS=-Zsanitizer=thread RUSTDOCFLAGS=-Zsanitizer=thread
cargo +nightly run -Zbuild-std --target x86_64-unknown-linux-gnu

Do you have any idea what situation could cause one of the above snippets to be a data race, without making both be a data race?

The bytes repo issue is #540 and my testing PR is #541. See also the boost documentation on implementing reference counting.

1 Like

Maybe just TSAN getting it wrong.
Giving a small test that triggers it using just bytes crate (without newsletter) would be better to reason about.
(I assume ref_count is typo from old code.)

No.. apparently not. Does #[cfg(thread = "sanitize")] not actually work?

Aha. It's #[cfg(sanitize = "thread")]. And that does trigger the unstable warning when I used it.

I guess I never actually ran the code I thought I ran.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.