Lock-free sharing of data using AtomicPtr and Arc


#1

I have a lookup table that is currently being shared between many threads: lots of worker threads that read the table at high frequency, and a single thread that periodically updates the table. Currently I use a Arc<RwLock<HashMap<String, String>>> data structure. But I found that the use of RwLock has a significant performance impact. I would like to replace this data structure with a lock-free data structure.

The approach I am trying to implement is to use an AtomicPtr combined with Arc to store the lookup table. To update the table, I would clone the HashMap, update the copy, and then atomically swap out the copies. The use of Arc ensures that any workers which might still hold a reference to the old copy can continue using it.

I’m trying to put together a proof-of-concept implementation but am having trouble getting the AtomicPtr to work as expected. The code (with lots of debug statements interspersed) can be found here: https://gist.github.com/jhecking/996f6c3d6edc7da2b18ab044a4a1f687.

I have a couple of specific questions but am also looking for feedback about the approach in general.

  1. I noticed that my app behaves differently depending on whether I compile it in debug or release mode. In debug mode the program just terminates without error message as soon as I try to dereference the atomic ptr for the first time. In release mode it runs almost all the way through but eventually also silently terminates (again without error) once the writer thread is finished. What is different between debug mode and release mode that could cause this?

  2. When I call AtomicPtr::swap I always seem to get the same pointer back that I passed into the method. E.g. in my replace method the ptr and old_ptr are always the same value:

    pub fn replace(&self, t: T) {
    let mut arc = Arc::new(t);
    let ptr: *mut Arc = &mut arc;
    mem::forget(arc);
    let old_ptr: *mut Arc = self.data.swap(ptr, Ordering::SeqCst);
    let old_arc: &Arc = unsafe { &*old_ptr };
    drop(old_arc);
    }

  3. Even though I can see that the old_arc has a strong ref count of 1 and I explicitly drop it, I do not see that the old values get dropped ever. Shouldn’t drop(Arc<T>) drop the value T if it was the last hard reference to the Arc? What am I missing?


#2

I added something similar to this to crossbeam a while back: https://docs.rs/crossbeam/0.2.10/crossbeam/sync/struct.ArcCell.html

Even if it doesn’t work for your use case, it should provide some help with the implementation.


#3

Seems pretty expensive - 2 atomic stores for each get()?


#4

One issue with the code is you’re calling drop() on a reference to an Arc, which is basically a no-op; you need to call drop on a value. If you look at drop(), it’s basically just consuming the argument, but does nothing else.


#5

There is no such thing as atomic swap for a structure such as HashMap.
Your code sets the correct pointer so it is getting updated across threads but that then is to potentially random memory, as you remove the safety by using unsafe and no locking.

old copy can continue using it.

This is a general method to get around locks slowing down code. So long as the old copy is good enough for the task your code is performing and making the copy does not take too long. The choice then is to clone the data out of the lock or transfer updates using MPSC.

The crossbeam ArcCell also is unsafe code. It is relent on implementation of particular hardware and the compiler internal to access the Arc safely. It does not give safe access to the arc’s contents.


#6

In your code there are many places when you do something like:


        let mut arc = Arc::new(t);
        let ptr: *mut Arc<T> = &mut arc;
        // then store the ptr somewhere

What you’re doing here is taking a pointer to a stack allocated temporary Arc! It’s definitely UB, because that pointer becomes dangling really fast, so i’m actually really surprised it works on release at all!

You should remember that the Arc itself is a pointer (or to be more precise, it contains something like *mut ArcInner<T>). I think that what you were planning to do was to make this pointer atomic. You can achieve that by transmuting the Arc to AtomicPointer (and back) (instead of putting Arc behind a pointer). Although that would be hard to implement correctly too, because the ArcInner could be dropped before you clone the Arc.

When you said that RwLock slows you down, is it because the RwLock's locking is itself slow, or because you borrow for write for too long? Anyway, I heard that the uncontended mutex is faster than RwLock, so maybe you should try: Mutex<Arc<HashMap>>. Just remember to lock the mutex only for cloning or replacing the Arc. Maybe RwLock<Arc<HashMap>> will also be a faster solution? (to clarify, that’ll actually be Arc<RwLock<Arc<HashMap>>>, since you’re sharing it between threads).

It seems that @sfackler’s ArcCell is exactly what you’re trying to do as your AtomicValue. You should definitely check it out!


#7

In what way it is providing an unsafe API?


#8

With the right computer a program as simple as this could trigger the assert.
The code allocates 5 to new memory but there is nothing to sync that memory so 5 exists in the cores cache but main memory remains uninitialised and so random. ArcCell then makes the memory address accessible from the other thread. (using an atomic.)
Alternative the writing of 5 could have made it to main memory but the second core could already have stored the contents of the memory address in its cache prior to it getting set.

use sync::ArcCell;
use std::sync::Arc;
use std::thread;

fn main () {
    let a1 = Arc::new(ArcCell::new(Arc::new(Box::new(5))));
    let a2 = a1.clone();

    thread::spawn(move || {
        loop {
            assert_eq!(**a1.set(Arc::new(Box::new(5))), 5);
        }
    });

    loop {
        assert_eq!(**a2.set(Arc::new(Box::new(5))), 5);
    }
}

#9

Thanks, Steven! I will give ArcCell a try and see whether it performs better than a RwLock in my benchmarks.

Just a note: You might want to update the documentation link in the crossbeam crate, as https://crates.io/crates/crossbeam links to an older version of the documentation at http://aturon.github.io/crossbeam-doc/crossbeam/. I am already using your MsQueue for another use case but wasn’t even aware of ArcCell.


#10

How would things be different if this were an Arc<Mutex<i32>>? The atomic operations involved in cloning the outer Arc and adjusting the contents of the inner ArcCell should impose memory ordering constraints that the CPU has to respect.


#11

I expect Arc<AtomicUsize> would be safe. Would need someone with better knowledge to confirm that Arc<Mutex<i32>> is safe to use.

The ordering constraints don’t force a regular variable that has been used just before an atomic to get written to main memory as opposed to kept correct only in the cores cache.

Using volatile variables can be a safe alternative to atomics.
std::ptr::read_volatile, std::ptr::write_volatile
No way for rust to check a program uses them safely though.


#12

Can you expand on this? The ordering constraints on atomic ops are precisely there to specify how prior/future loads and stores are ordered with respect to the atomic operation - they don’t order just the memory behind the atomic (unless we talk about Relaxed ordering, but that’s not used in ArcCell). If you perform a release’ing store, all prior stores are made visible (needs to be paired up with an acquire’ing load). A SeqCst ordering is a full bidirectional fence around the atomic (i.e. no loads/stores can move before the fence and no loads/stores can move after the fence).


#13

This is pushing to the edges of my knowledge so someone with better lower level experience (over multiple independent hardware) could possibly give better more informative answer.

a = A+B
b = C+D
c = A+B+E

We write code step 1, step 2, step 3. Both the compiler and the CPU are then free reorder this so long as the end result is the same. Above b could be set first (or last) then a. There can be is a speed up by using the result of a to complete calculating c. A different computer/compiler could keep the steps as they are but remember A+B in a register to speed up c.

If a and c are regular memory and b atomic the choice of Ordering allows/prevents this swapping.

To computer code the CPU cache is a black box, any regular variable load/store goes through it and you don’t know if the contents of the cache and contents of main memory are the same. To synchronize, the code can send two instructions; flush and invalidate. Locks such as mutex and rwlock run these. They are also used when creating new threads and mpsc (I expect, without looking.)

An atomic does not go through the cores cache. The second core can access the atomic but it can also have an independent cache from the first. The code will honour the instruction ordering but does nothing to synchronize anything using cache.


#14

I think that’s the misunderstanding - atomic operations, depending on their Ordering, are not only compiler fences but also CPU fences. The type of fence, if any, emitted is dependent on underlying target CPU (e.g. some CPUs may not need any fence for some ordering, such as x86 that doesn’t reorder loads with other loads nor stores with other stores, but can make a store/load appear to reorder due to store buffers). Anyway, the overarching aspect is the Ordering imposes a memory order model for the code, and the compiler then respects it itself and also ensures its respected by the target CPU it’s compiling for.


#15

I assume that’s std::sync::RwLock? parking_lot has another RwLock implementation. Its README claims:

The numbers for RwLock vary depending on the number of reader and writer threads, but are almost always faster than the standard library RwLock, and even up to 50x faster in some cases.


#16

Thanks for the tip, @SimonSapin! I tried replacing the std::sync::{Mutex, RwLock} primitives with the equivalent implementations in the parking_lot library and saw an up-to 10% performance improvement in my benchmarks for certain workloads.