Mutex overhead in realtime context

Hey everyone! I'm developing a real-time application targeting 120 FPS on iPad and Android tablets. I'm concerned about my current approach to state management and synchronization, particularly regarding performance implications.

My application is modular, with each module maintaining its own global state protected by a mutex. Every frame requires both read and write access to this state for updating the state (primarily matrix calculations) and rendering based on the current state.

Here's a simplified example of my current implementation:

let mut editor_state = EDITOR_STATE.lock();
step(view, &mut editor_state);    // Writes
render(view, &editor_state);      // Reads

Additionally, user interactions (like camera movements) require mutex locks and during gestures like panning or zooming, we're locking more frequently:

pub fn move_camera_to_home() {
    let mut state = EDITOR_STATE.lock();

    state
        .camera
        .position
        .tween(Vec2::ZERO, 2., curves::f32::smoother_step);
}

This works fine and I haven't noticed any performance issues so far, but I'm wondering if this is going to cause problems hitting my target frame rate in the future.

I've been thinking about switching to RwLock (parking_lot) instead of Mutex, and maybe keeping just read access during rendering while storing the calculated matrices in a separate rendering state which doesn't require any synchronizations. Would love to hear if anyone's dealt with similar situations or has suggestions for better ways to handle this kind of state management in a real-time context.

Thanks in advance!

This might be tangential to your question but since you don't mention threading, I took the conservative interpretation and assumed the locks have minimal contention. "Modular" doesn't describe anything about the architecture other than how you chose to organize code and state.

Broadly speaking, locking is inexpensive when not under contention. When only a single thread accesses the lock, the overhead is minimal. To show what I mean, I whipped up a silly benchmark to compare lock implementations (take this with an absolutely massive grain of salt):

Simple benchmark for Mutex overhead

The initial run on my machine looks like this:

WithoutLock/no lock     time:   [617.77 ps 618.66 ps 619.71 ps]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

WithLock/std::sync::Mutex
                        time:   [3.3921 ns 3.3970 ns 3.4027 ns]
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  9 (9.00%) high severe

WithPLLock/parking_lot::Mutex
                        time:   [3.7761 ns 3.7817 ns 3.7885 ns]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

WithSpinLock/spin::Mutex  time:   [2.5141 ns 2.5176 ns 2.5215 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

WithRwLock/std::sync::RwLock
                        time:   [3.7974 ns 3.8041 ns 3.8125 ns]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

WithPLRwLock/parking_lot::RwLock
                        time:   [4.1135 ns 4.1198 ns 4.1262 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

WithSpinRwLock/spin::RwLock
                        time:   [3.1815 ns 3.1876 ns 3.1938 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

What we can gather is, at least on this particular hardware and OS configuration for single-threaded workloads with this specific test setup:

  • Going without a lock is by far the fastest: about 4x faster than the fastest lock in the comparison.
  • spin locks are consistently the fastest implementation.
  • parking_lot locks are consistently the slowest implementation.
  • RwLock is consistently slower than Mutex, regardless of implementation.

Takeaways:

With a purely single-threaded workload (on my machine), the lock overhead between 3 - 4 ns. That is going to be immeasurable compared to the rest of your application.

Adding threads (particularly, contention between threads over the same lock) will change the results drastically! The spin locks will almost certainly be the worst in this case.

Using an RwLock when you don't benefit from multiple concurrent readers will only hurt performance.

The truth is, I don't know what your specific environment or application looks like. You will have to profile the use of locks in situ to discover if any performance is being left on the table. If what you have now is good enough, you might not want to bother going down this path of measuring and optimizing. Presumeably you have more important or urgent things to do.

5 Likes

Note that the slowest of those times is 4.1 nanoseconds. Your frame time 1/(120 FPS) is 8.33̅ milliseconds, or 8,333,333 nanoseconds. If your application acquires even hundreds of uncontended locks during its state updates and rendering, this is still very likely an insignificant cost compared to the cost of your actual simulation and drawing algorithms. If there is contention — that is to say, if your update and rendering try to overlap in time and are prevented by the mutex — then it will be slower, but still, the fact of the overlap itself is likely much more significant than the cost of the mutex.

So, for this application, you should probably not worry about the cost of the lock. Worry about the big picture — how your update and rendering are scheduled, and how much state they share.


For example, one possible strategy is to make two copies of your application state, the simulation copy and the rendering copy. Then you have 3 tasks:

  1. Simulation copy is updated.
  2. Rendering copy is overwritten with the current contents of the simulation copy.
  3. Rendering copy is rendered to screen.

Then, task 1 and task 3 are able to execute in parallel, so the only “sequential bottleneck” is task 2 which only needs to copy data, not compute anything.

It's also possible to implement this in a lock-free fashion by using 3 copies shuffled around using channels (1 copy is owned by the simulation task, 1 copy is owned by the rendering task, and 1 copy is being sent from one task to the other via one of two channels). That may or may not be worth doing in your application, but this sort of pipeline-oriented architecture can get you close to the maximum possible throughput (though not necessarily the lowest latency).


This sort of architectural decision will have much more effect on performance than your choice of mutex, unless you are using mutexes that are repeatedly locked very often during a single simulation step.

6 Likes

As @kpreid wrote: In practical terms, it purely depends on how often you acquire a lock per frame. To get a sense of how frequently your code actually acquires locks per frame, you can build a wrapper around Mutex and count all calls to .lock() using a single global AtomicUsize. At the end of each frame, read the lock counter, save the value, and reset the counter. Then, load your measurements into your favorite spreadsheet or Jupyter Notebook to calculate some statistics and create detailed graphs.

1 Like

Thanks a lot, seems like I shouldn't worry about it! In my case I'm only using 1-2 locks per frame, and I'm doing all the updating and rendering in one thread. I only switch to multi-threading for certain heavy calculations. I believe The iOS UI runs on its own thread, so it might lock occasionally, but that's expected.