Rc/RefCell vs Arc/Mutex performance

Playground: Rust Playground

Related topic: How much more expensive is Arc vs Rc Lots of discussion, but no answer.

So I wrote a simple test - lock, add 1 to a value, unlock.

On Linux (Intel 64-bit), I get numbers like this when that's compiled in Release mode:

./locktiming
Lock test for 1000 items: 0.0001 secs.  0.0700 usecs per lock.
Rc test   for 1000 items: 0.0000 secs.  0.0070 usecs per lock.

Lock test for 10000 items: 0.0007 secs.  0.0690 usecs per lock.
Rc test   for 10000 items: 0.0000 secs.  0.0020 usecs per lock.

Lock test for 100000 items: 0.0018 secs.  0.0170 usecs per lock.
Rc test   for 100000 items: 0.0003 secs.  0.0020 usecs per lock.

Lock test for 1000000 items: 0.0163 secs.  0.0160 usecs per lock.
Rc test   for 1000000 items: 0.0058 secs.  0.0050 usecs per lock.

This is unexpected. Rc is about 10x faster for small amounts of data, but only 3x faster for large amounts too big for the cache. No wonder this is a controversial question.

I'd appreciate it if some people would try the example in Playground above compiled on their own machines, in Release mode, and post their numbers. I'd like to see Windows and Mac results. I've heard that Windows locking is much slower than Linux, but don't know.

4 Likes

Windows

Lock test for 1000 items: 0.0000 secs.  0.0300 usecs per lock.
Rc test   for 1000 items: 0.0000 secs.  0.0020 usecs per lock.

Lock test for 10000 items: 0.0004 secs.  0.0350 usecs per lock.
Rc test   for 10000 items: 0.0001 secs.  0.0080 usecs per lock.

Lock test for 100000 items: 0.0035 secs.  0.0350 usecs per lock.
Rc test   for 100000 items: 0.0012 secs.  0.0110 usecs per lock.

Lock test for 1000000 items: 0.0358 secs.  0.0350 usecs per lock.
Rc test   for 1000000 items: 0.0096 secs.  0.0090 usecs per lock.

the 2 tests are both single thread. but the gain in multi thread will pay the cost of Mutex. :grinning: and there is no cost for Arc if you do not clone it.

Not to sound discouraging, but I didn't quite understand what you're trying to measure here.
Afaik, the only difference between Arc and Rc is that Arc using an atomic integer to keep the ref-count. That as such has nothing to do with OS, but with the processor and its ISA - atomic instructions are a feature of the processor.
On the other hand, you are using a Mutex - that involves locking and is OS dependent. Now, the way you have tested your Arc<Mutex>, there is never any mutex-contention since since you take a lock, release it and then move on to the next mutex. So I am guessing you are trying to measure how much time it takes to take a lock in the best case (0 contention).
Also, the ref-count of either Arc or Rc is never incremented after creation - for that you'd need to call Arc::clone or Rc::clone.
So I don't think you are actually measuring Rc vs Arc. Rather you seem to be measuring RefCell borrow time vs Mutex lock time.
Could you please clarify?

PS: As a side note, Jon Gjengset has a wonderful series on YouTube on the exact things you are trying to investigate (and other stuff). It is called "The Crust of Rust".
To be more specific:

These two videos are about the things you are looking into.

2 Likes

This is measuring exactly what I want to measure - how fast is it to lock a mutex in the no-contention case. I'm making something multi-threaded, and it involves a large number of objects with individual locks. Contention is possible but unlikely.

2 Likes

Another windows: ( Intel(R) Core(TM) i7-6700HQ CPU @ 2.60 Ghz )

Lock test for 1000 items: 0.0000 secs.  0.0170 usecs per lock.
Rc test   for 1000 items: 0.0000 secs.  0.0010 usecs per lock.

Lock test for 10000 items: 0.0002 secs.  0.0180 usecs per lock.
Rc test   for 10000 items: 0.0000 secs.  0.0030 usecs per lock.

Lock test for 100000 items: 0.0019 secs.  0.0190 usecs per lock.
Rc test   for 100000 items: 0.0007 secs.  0.0070 usecs per lock.

Lock test for 1000000 items: 0.0201 secs.  0.0200 usecs per lock.
Rc test   for 1000000 items: 0.0073 secs.  0.0070 usecs per lock.

Without thinking too carefully about it, this is what I would expect. Arc's cost is that it involves accessing main (correction: shared) memory rather than the local cpu cache. If you set the test up so that the items are not in the cache, Arc's disadvantage compared to Rc goes away. But I would need to think about it more!

So what was about the "Arc vs Rc" question?

Need to think about that one. On ARM, you'd need a fence instruction to force cache coherency. Not sure if that's needed on x86. But yes, something has to force inter-processor consistency on the "locked" flag to prevent race conditions on the mutex. RefCell doesn't need that protection.

Overall, I'm not too worried. 20ns for a lock is no problem. It's not like the bad old days of calling

pthread_mutex_lock()

in C. At one time I think that always made a system call on Windows.

I should have said "shared memory" rather than "main memory", there can be ( and I think usually is ) a shared cache. As per this diagram: CPU cache - Wikipedia

You might also want to consider throwing parking_lotʼs Mutex into the comparison as another data point.

It would also be interesting if RefCell gets optimized to the same performance as Cell in the simple example of incrementing a number. Similarly, use of an atomic integer would be a good comparison point to see how well-optimized to the specific test case here the parking_lot mutex would be.

Finally, for a base comparison, having a simple owned Box<_> instead of Rc<RefCell<_>> could be measured, too.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.