We know that because Rc does not need to be thread safe and Arc needs to be thread safe, Arc needs to use concurrent primitives and thus is in theory more expensive.
However, I believe most modern CPUs have atomic cmp/swap, atomic inc, or atomic dec.
Thus, my question: in practice, how much more expensive is using Arc instead of Rc (in situations where Rc would suffice).
Another insight is that the atomics preventany instruction level parallelism, significantly limiting thebandwidth (up to 30x in comparison with simple writes),even if there are no dependencies between the successive operations
It seems that the primary slowdown is likely to be when attempting to perform several atomic operations in quick succession. If there is significant non-atomic work to be done between the atomic operations, the effect should be minimized. As with all things related to the performance of modern computers, however, it's important to test your particular application-- heuristics and rules of thumb will only get you so far.
Does it make sense to use the structure Rc<Arc> the n? Point being: within a thread, we have a shared Arc, when the thread no longer needs it, it drops to 0, calling the dec on Arc.
I guess one could quickly knock up a hundred line program that has function doing the same thing but with Rc and Arc. And cargo bench would very soon tell you something.
But what would it tell you? Would it have any relevance to any application you want write?
It is pretty much impossible to predict the performance of modern day processors, even without threads, what with their mutiple cache levels, pipelines, instruction reordering, parallel dispatch, branch predictors etc. Gone are the days of looking at your assembler code, looking up the number of clock each instruction took in the data sheet and hence knowing how long things would take.
Throw threads into the mix and it gets a thousand times worse.
My naive take on all this is that an atomic operation is akin to saying that the data in question is not in cache and at least requires a trip to main memory and back. So that is already massively slower than cache friendly code. That paper indicates a 30 times performance hit, which sounds pretty good to me, I presume a lot of that memory round trip is short-circuited in the cache coherence hardware or whatever they have in there.
The XY problem is that I am working on interning strings, which requires keeping maps
Vec<u8> -> usize
usize -> &Vec<u8>
Now, to free unused strings, we need to wrap the usize and have a drop handler which removes the string from the maps when no longer in use.
Question is: should this mapping be a global singleton (requiring using Arc for the 'StringHandle's) or some locally created object (requiring only using Rc for the 'StringHandle's).
This is the motivation behind the Arc vs Rc issue.
No idea but my gut tells me that if one is building a library/crate to implement some data structure it had better not impose threads on the user and hence Arc is not appropriate.
Rather depends what your intentions for all this are.
Just remember that "mutex" should more correctly be called "bottleneck". As someone famous in the C++ world put it, sorry I don't recall who.
Yes, both are free, because dereferencing the pointer does not require checking the reference count. The difference between Arc and Rc only comes into play when the reference count is updated (when cloned or dropped).
With a bit of generics you can have Rc and Arc working with the same code.
However I would advise against trying to remove entries when their refcount goes to zero. It gets quite complex and there is no substantial gain compared to providing a .clean() function that scans for unused entries. (Or if you are using a custom map, insert could overwrite unused slots.)
pub struct StringManager { // only construct one of these, globally
// ...
map: Arc<Mutex<HashMap<Vec<u8>, usize>>>
...
}
pub struct StringData(usize);
pub type StringId = Arc<StringData>
impl Drop for StringData {
fn drop(&mut self) {
... stuff to remove self from StringManager
}
}
So in this case, the StringManager (only invoked when constructing / dropping StringData, uses Arc<Mutex<...>>) but the StringId (which we clone / drop all the time) only uses Arc<StringData>
I have half tempted to make StringId be a Rc<Arc<StringData>> so that even most clone/drops are cheap.
It would be possible to write a smart pointer with similar behavior to Rc<Arc<T>> but without the double indirection. @TyOverby wrote a basic implementation of this here, though it is not really maintained or tested:
One way to look at Mutex<T> is that it provides interior mutability. If you don't need to mutate the value but still need thread-safe clones and drops, Arc<T> can be used alone.
From memory, the Rust compiler uses a thread-local cache when interning identifiers. If you are okay with StringHandle: !Send, then a thread-local would give you the ergonomics of global variables with the performance of non-atomic reference counting.
Also, depending on how frequently you clone the StringHandle the cost of bumping a reference count may not matter. Most of the time you'll pass the underlying &str to functions and only need a StringHandle when storing the string long-term.
As long as you avoid unnecessary cloning (which is good practice in general) I'm guessing you wouldn't notice a performance difference between Rc and Arc, anywat.
To answer the OP, I found a paper with survey about the reference counting overhead of the Swift language. It says on popular client side benchmarks, the atomic counter update itself takes 42% of the total execution time on average.
Keep in mind that in swift virtually everything is Arc-ed, so it puts heavy optimization on it like merging several counter updates within a scope into single atomic operation. Also the paper itself is about to optimize the reference counting so the researchers may have some bias that their domain is more important than it really is. But, after all, the number 42% is truly impressive.