Are Weaker Memory Ordering Constraints Possible?

I made a custom Arc, which is not cloned, but copied (via atomic loads), i.e. I have to manually increment and decrement the count. This led me to a design, that drops the inner Box when the count reaches exactly 0 (the count can get into the negatives). All atomic operations on the count are sequentially consistent, because I think using any weaker memory ordering could either lead to memory leaks or worse, use-after-free. If I'm either making wrong assumptions about the memory order requirements or the design could be changed to weaken the memory ordering, then please, let me know!

(Github)

use crate::std;

#[repr(align(128))]
struct Align128<T>(T);

impl<T> std::Deref for Align128<T> {
    type Target = T;

    fn deref(&self) -> &Self::Target {
        &self.0
    }
}

#[repr(align(128))]
struct RawArc<T> {
    count: Align128<std::AtomicIsize>,
    data: T,
}

pub struct Arc<T> {
    ptr: std::NonNull<RawArc<T>>,
    phantom: std::PhantomData<RawArc<T>>,
}

impl<T> Arc<T> {
    pub fn raw(data: T) -> *const () {
        std::Box::into_raw(std::Box::new(RawArc {
            count: Align128(std::AtomicIsize::new(0)),
            data,
        })) as _
    }

    /// Constructs an `Arc<T>` from a raw pointer.
    ///
    /// The raw pointer must have been previously returned by a call to
    /// [`Arc<U>::raw`][raw] where `U` must have the same size and
    /// alignment as `T`. This is trivially true if `U` is `T`.
    /// Note that if `U` is not `T` but has the same size and alignment, this is
    /// basically like transmuting references of different types. See
    /// `std::mem::transmute` for more information on what
    /// restrictions apply in this case.
    ///
    /// The user of `from_raw` has to make sure a specific value of `T` is only
    /// dropped once.
    ///
    /// This function is unsafe because improper use may lead to memory unsafety,
    /// even if the returned `Arc<T>` is never accessed.
    ///
    /// [raw]: struct.Arc.html#method.raw
    pub unsafe fn from_raw(ptr: *const ()) -> Self {
        Self {
            ptr: std::NonNull::new_unchecked(ptr as _),
            phantom: std::PhantomData,
        }
    }

    /// Initializes the reference count.
    ///
    /// Used for synchronization purposes and freeing the underlying memory.
    ///
    /// # Safety
    ///
    /// This method **must not** be called more than once. The provided count
    /// **must** be at least 1.
    pub unsafe fn init_count(&self, count: isize) {
        {
            use std::panic;
            std::debug_assert!(count >= 1);
        }

        self.ptr.as_ref().count.fetch_add(count, std::SeqCst);
    }
}

impl<T> Arc<T>
where
    T: std::Copy,
{
    pub unsafe fn data_from_raw(ptr: *const ()) -> T {
        (*(ptr as *const RawArc<T>)).data
    }
}

impl<T> std::Drop for Arc<T> {
    /// Decrements the read count of the inner `RawArc`. If the count reaches 0,
    /// the boxed `RawArc` is dropped.
    fn drop(&mut self) {
        let count = unsafe {
            self.ptr
                .as_ref()
                .count
                .fetch_sub(1, std::SeqCst)
                .checked_sub(1)
                .unwrap_or_else(|| std::unreachable_unchecked())
        };

        if 0 == count {
            // A count of **exactly** 0 implies exclusive access to the boxed
            // `RawArc` and ensures safe construction of the `Box` to drop it.
            std::drop(unsafe { std::Box::from_raw(self.ptr.as_ptr()) });
        }
    }
}

In terms of ordering, I believe you only need Relaxed when incrementing the reference count and Release when decrementing it.

To increment the reference count you're already holding a copy of the Arc so there's no chance of another thread decrementing the count and freeing your object, you just need to make sure the increment happens atomically (i.e. ordering doesn't matter, but we don't want things like torn reads). Which is what Relaxed gives you.

You may also want to check out the comments around impl Drop for Arc<T>. They do a really good job of explaining the ordering logic.

Getting the atomics correct in Arc is extremely subtle -- they were even incorrect in std for many years until work way done to try to formally prove them and found the problem:

What platform are you running on? Are you sure that a weaker one would even help performance?

1 Like

I don't know a single platform where SeqCst is not slower than any of the weaker memory orderings.

The operations themselves may take the same amount of time, but choosing different orderings can affect performance by preventing the compiler (or processors doing out-of-order execution) from rearrange instructions.

I think you need both Release and Acquire operations to have any effect beyond using Relaxed everywhere.

The Acquire/Release ordering is completely independent of SeqCst operations on the same memory location. If I remember correctly, Acquire only guarantees ordering wrt. Release operations and vice versa; Relaxed provides no ordering guarantees at all.


On the other hand, I’ve been very wrong about atomic behavior before; that could be the case here as well.

The cited Boost documentation in the drop code comments talks about release and acquire operations, but an acquire fence, as used in Rust's Arc, is not an acquire operation. :thinking:

If you go to the top of the sync.rs file, you'll see that acquire!() is just sugar for .load(Acquire), so it's definitely an atomic load operation. I'd wondered about that too, so I had to find the macro def.

It only uses load for some weird configuration (no idea what ThreadSanitizer is, maybe some testing framework?). Without the configuration, it just uses an acquire fence. I'm not entirely sure, if it makes a difference in this specific case, though.


For now, I replaced both std::SeqCst with std::AcqRel, because it makes sense, that with a single atomic, acquire and release semantics can completely replace sequentially consistent semantics.

I'm still struggling with weakening the constraint of the fetch_add operation. It wouldn't synchronize-with a fetch_sub operation from a different thread, would it?

I imagine increasing the count in thread 1 is not visible to thread 2, even if it happened before in time, i.e. the load of fetch_sub could see the old value and store an updated value based on that old value, basically losing the update from thread 1.

Yes, my read of the Arc code is consistent with what's been said so far - the decrement needs a Release, and the actual drop of the inner value needs an Acquire. A single AcqRel to decrement would take care of both of these but might not be as fast and is likely faster; this is a potential improvement that might be suggested to Arc as well. I did some experiments on Godbolt. The results on x64 are that all combinations of ordering constraints generate the same code: a lock sub qword ptr [rdi], 1. On aarch64, however, the AcqRel generates ldaxr / subs / stlxr, while changing that to Release generates ldxr / subs / stlxr, and an Acquire fence after it generates a dmb ishld. I found a benchmarking paper that suggests that the former code sequence is faster, though as always these decisions should always be guided by empirical measurements. Incidentally, SeqCst generates the same code as AcqRel for this particular code.

I'll also take this opportunity to point out that many of the implementations I've seen of the Release method in custom COM objects get this wrong. I find it useful to think of COM objects as being very similar to Arc<UnsafeCell<T>>. I should probably file some bugs, but I consider the ecosystem around COM to still be evolving.

So my recommendation would be to use AcqRel, and I'll be doing this for my own COM implementation work going forward.

2 Likes

I don't think the increment really needs to synchronize-with anything. When modeling this, it might be more helpful to consider an increment/decrement pair, where the decrement by definition happens-after the increment. What's really important is that the decrements synchronize-with each other, or more precisely, the deallocation that happens after a decrement is separated by any use of the contents by a release/acquire pair.

I believe it's also the case that if the contents are truly and deeply immutable, then the Acquire before deallocation is not actually necessary, but I haven't verified this. (Incidentally, a good middle ground between trying to language-lawyer the spec and using formal proofs is using a tool such as loom, or CDSChecker in the C/C++ world)

I'm very confused by your initial comment that the counts can go negative. I think that's absolutely not the case, and potentially a source of your confusion. The sequence of values held by the atomic itself are effectively sequentially consistent, even when Relaxed ordering is used. The ordering constraints are all about interactions between that value and the other parts of your code.

1 Like

If you take a quick look at the linked repository in the arc_handle module, you'll see how it'll be possible to go into the negatives.

  1. The raw Arc is binary-shifted by 7 to the right (128 byte alignment of RawArc means, that the 7 least significant bits are always 0, thus shifting by that amount will retain all information) and transferred to an atomic. The 7 bit 0 represents a ref count.
  2. Thread 1 loads the raw Arc handle and tries to increase the handle count by 1. Let's assume it succeeds.
  3. Thread 2 swaps out the raw Arc handle with another one.
  4. Thread 1 copies the data of the Arc via Arc::data_from_raw, tries to decrement the count from the raw handle, but fails, because it was swapped out and the raw arc part has also changed. It proceeds to call Arc::from_raw on its pointer. It then drops the Arc. The drop code decrements the count by 1. It is -1, now.
  5. Thread 2 calls Arc::from_raw, followed by Arc::init_count with the value 2 (the count from the handle + 1 for itself). The count is 1, now. The Arc is dropped. The count is reduced to 0 and the box with the RawArc is created and dropped.

Loom seems like a great tool. I'm just not sure how I can detect a possible memory-leak or double-free, the latter without causing UB, which would invalidate the tests. Do I have to conditionally remove the Box::from_raw in Arc::drop when testing with Loom? If so, what would I have to replace it with to make it testable?


EDIT: It seems, Loom has their own version of alloc, dealloc and Layout. I'll try using those, instead of Box. It's slightly less convenient to use, but nothing, that'd stop me from using Loom to properly test my code. It's an easier approach to "proving" correctness than a formal proof and will hopefully help me gain some confidence in my concurrent code.

I'll report back, once I have added the tests. It's probably going to be of interest to those who are also not absolutely certain about which memory ordering constraints are sound to use.

Loom just runs your code with all the possible atomic permutations. It's up to your code to panic if something has gone wrong. This too applies to double-free or use-after-free, and its up to your code to detect that situatoin.

The only tool that can properly detect memory-leak + double-free + use-after-free is miri, but I haven't had much success with using it in multi-threaded environments. Although I think there have been some improvements in this area lately. I don't know if you can use both loom and miri.

What are the custom alloc/dealloc functions for, if not to detect memory leaks and double-free?

…

Perhaps, it's only useful for custom global allocators?? Then I'm back at where I was before. Figuring out how to detect memory leaks and double-frees without having to let test code invade my Arc implementation. :slightly_frowning_face:

Loom's alloc module can help you detect memory leaks, but not the others.

I just looked at the source code:

(loom::alloc)

/// Allocate memory with the global allocator.
pub unsafe fn alloc(layout: Layout) -> *mut u8 {
    let ptr = std::alloc::alloc(layout);
    rt::alloc(ptr);
    ptr
}

/// Deallocate memory with the global allocator.
pub unsafe fn dealloc(ptr: *mut u8, layout: Layout) {
    rt::dealloc(ptr);
    std::alloc::dealloc(ptr, layout)
}

(loom::rt::alloc)

/// Track a raw allocation
pub(crate) fn alloc(ptr: *mut u8) {
    rt::execution(|execution| {
        let state = execution.objects.insert(State { is_dropped: false });

        let allocation = Allocation { state };

        let prev = execution.raw_allocations.insert(ptr as usize, allocation);
        assert!(prev.is_none(), "pointer already tracked");
    });
}

/// Track a raw deallocation
pub(crate) fn dealloc(ptr: *mut u8) {
    let allocation =
        rt::execution(
            |execution| match execution.raw_allocations.remove(&(ptr as usize)) {
                Some(allocation) => allocation,
                None => panic!("pointer not tracked"),
            },
        );

    // Drop outside of the `rt::execution` block
    drop(allocation);
}

This does look like it tracks both memory leaks and double-free correctly, at first sight. :thinking:

It's possible I'm wrong of course.

To be fair, the documentation regarding those functions are non-existent. It's a literal copy from the standard documentation, without any additional information about what loom adds to the functionality.

Status report: Sadly, I got errors when trying to run the test.

  1. Loom doesn't have AtomicIsize (fixing that was a matter of adding 1 line and changing 1 other line)
  2. It doesn't support Atomic*::get_mut (this is the tough one; me trying to fix this half-asleep doesn't really work; next try, tomorrow)