RwLock slow to acquire read lock even when uncontended

I have some code I'm trying to parallelize using threading with shared-state concurrency. Specifically, I'm trying to simulate diffusion-limited aggregation, but understanding (the physics behind) that in detail should not be necessary to get my problem.

In short, I am spawning a number of threads that each do some stuff to generate a particle position. The threads then need to check for collision between other particles using a central list of "fixed" particles. If there is a collision, the thread needs to mutate this central list.

The collision checking is the most expensive operation in the simulation, and collisions are relatively uncommon (they don't happen every timestep), so I thought to parallelize this by wrapping this shared state in a RwLock.

The code looks like:

let shared_state = RwLock::new(SharedState {
    cell_list,
    n_fixed: 0,
});

let mut fixed_particles: Vec<FixedParticle> = Vec::new();

thread::scope(|s| {
    let mut handles: Vec<_> = (0..args.n_threads-1)
        .map(|_| s.spawn(|| {
            let mut p = new_random_particle(u, n, &mut thread_rng(), args.timestep);
            let mut fixed_particles: Vec<FixedParticle> = Vec::new();

            loop {
                // Brownian motion
                integrator.integrate_a(&mut p);

                // Block on read access to shared state
                let rg = shared_state.read();

                if rg.cell_list.check_overlap(p.position) {
                    let glue_position = p.position;

                    // Randomise particle position
                    p = new_shell_particle(n, &mut thread_rng(), r1(rg.max_r_sq.sqrt()), seed.position);

                    // Release read lock and block on write lock:
                    drop(rg);
                    let mut wg = shared_state.write();

                    wg.cell_list.insert(glue_position)
                    wg.n_fixed += 1;

                    drop(wg);

                    fixed_particles.push(FixedParticle { position: glue_position, timestep: i });
                }
            }

            fixed_particles
        }))
        .collect();

    for handle in handles {
        fixed_particles.append(&mut handle.join().unwrap());
    }
});

However, I noticed that the performance barely scales with the amount of threads. Moreover, profiling revealed that the program spends almost all its time in acquiring the read lock:

This was surprising to me. For testing purposes, I completely commented out the part where the code mutates the shared state, removing the write lock altogether. Strangely, the behavior now is still completely the same.

So, even when the lock is 100% uncontended (as there are no writers whatsoever), the performance of acquiring a read lock on the RwLock is still really poor!

I tried this with std::sync::RwLock as well as with parking_lot::RwLock, but the results are basically the same.

Is this expected behavior, or am I doing something wrong here?

This is only partially true. Acquiring a read lock still needs to do an atomic write, which means the memory is actually contended.

4 Likes

Right. To clarify, by "uncontended" I meant that there are no write locks at any point.

I understand that acquiring a read lock still involves atomic stores to the reader counter, but could that really explain my performance problem?

If so, is there a better way to do this?

Is check_overlap really small/cheap? If so that might explain it. You could verify this by adding some arbitrary work to that function.

You are effectively serializing all the work and make sure that only one core may ever do any forward progress. What do you think would happen?

Just read about how MOESI works and you'll understand why what you write doesn't make much sense.

Another possibility is that you have too many threads that are contending on the same atomic (the RwLock). If there is no IO, it is counterproductive to have more threads than cores.

You still can only take RwLock on one core at time. And if you plan to scale for 100 cores that means then you have to, somehow, ensure that you would do around 10'000 operations for every time RwLock is takes.

Otherwise your cores would mostly play ping-pong with that one poor cache line instead of doing productive work.

You may introduce many RwLocks to ensure that they wouldn't play that ping-ping with just one line.

Or do more work under one RwLock.

Or pick any other strategy which wouldn't make the whole thing contended.

It is the most expensive part of the program. But I suppose it could be too cheap compared to getting a read lock on the RwLock. (I didn't expect that to be such a significant issue)

What I would think would happen, is that multiple threads could read from the RwLock without much performance impact, and would only be hindered by a thread acquiring a write lock (which happens relatively uncommonly). It was my understanding that this is the purpose of RwLock: allowing multiple threads to access data concurrently while only allowing one writer. I didn't expect to have such a significant performance hit just from multiple threads acquiring read locks alone; I expected that to be pretty much free.

1 Like

I suppose that is the issue (multiple threads constantly being contending on the atomic reader counter of the RwLock).

I use the same amount of threads as I have cores, though.

Unlike a Mutex<T>, the point of RwLock<T> is to allow multiple readers concurrently, right? I understand that a write lock on a RwLock is still exclusive, but that only happens (relatively) sporadically in my program, and is not the source of my performance problem (as evidenced by the issue still occurring after commenting out the write lock portion of the program altogether).

Since the check for collisions is so cheap, do you really need multiple threads checking in parallel? Maybe a single thread is all you need.

But how would that happen? By magic? CPU core still needs to inform all other cores that it's doing the reading and increase counter.

It does that by achquiring excludive lock for that cache line where that counter resides.

Which effectively serializes the whole thing.

Yes, that's true.

But how? What is supposed to happen “down below” at the level of these flip-flops that are computers are made from?

Yes, physics is a bitch but we have to deal with it, our devices are still following the rules of physics, they are not magical.

The sane solution there would be to have RwLock per participle (or maybe group of participles) and then acquire two of these (don't forget to order two write locks to avoid deadlocks).

But start discussion about these solutions one needs to understand that TANSTAAFL principle applies to computers just like it applies to everything else in our life.

But how would that happen? By magic? CPU core still needs to inform all other cores that it's doing the reading and increase counter.

It does that by achquiring excludive lock for that cache line where that counter resides.

Which effectively serializes the whole thing.

Sure, but a thread does not hold that lock on the RwLock's internal read counter for as long as the RwLock's read lock is held, right? That would make no sense, as that would essentially make a RwLock completely (functionally) equivalent to a Mutex.

In other words, if the work a thread does inside the section that requires a read lock (check_overlap() in my case) would be more expensive, it would still yield better performance than serial.

Yes. Precisely. It takes as much effort to take RwLock<T> for read as with normal Mutex but since readers don't block each other it may be good thing to do if you do a lot of work under lock. But acqusition of lock is still sequential.

Yes, because your program creates insane contention for that poor single cache line with RwLock everything else doesn't matter much.

Since Mutex is cheaper than RwLock I suspect “one RwLock per particile” would be beer than “one Mutex per particile”.

Or, maybe even better, generations where T+1 participles are independent from old ones. Then you wouldn't need per-participle lock at all.

There are many way to solve the issue, but “let's add that one RwLock and suddenly our code would become parallelizable” just doesn't work.

Why do you think CPU makers spend so much effort trying to speed up one, single core? Hundred millions of transistors instead of tens of thousands for measly 25x speedup? When you can instead make 10000 cores for 10000x speedup?

Precisely because of physics and because of effects you are observing. Information travels slowly between cores, there are no way to avoid that!

No, of course no.

Yes, but not if you assume that read lock is free. RwLock is more expensive than normal Mutex even if you do read locking (and even more expensive if you do write locking). The only redeeming quality is precisely it's ability to allow work of two readers simultaneously. And if you only do small amount of cheap work under insanely contended RwLock… it just wouldn't work.

A few things I can think of:

  • change your algorithm to take the read lock on larger batches. since you say that collisions are rare you might not even have to do that for the write guard
    • a variant thereof: see if your problem is amenable to partitioning (tiles? volumes?) e.g. if something like a maximum particle speed bounds the distance at which things can interact per time step
  • check literature if people have better algorithms for your problem
  • put all your shared state in Atomic types and make sure you can handle concurrent modifications
    • maybe combined with stamped locks, which are cheaper than RWLocks but much harder to use correctly
  • consider parallelizing along a different axis (e.g. experimental parameter permutations or or something like that)
3 Likes

Unrelated nit: to get n_threads, don't subtract one. The upper bound is exclusive unless you use ..=

1 Like

Here's one idea to reduce read locks, at the cost of threads working off of somewhat stale data.

(I don't know how bad that would be for your needs. but note that in your code, arbitrary changes can happen right after you drop the read guard, before you acquire the write guard - so it already has the potential to work off of stale data, at least in principle).

The idea is to allow the threads to refresh their list of fixed particles only once in a while.

In order to allow the old lists to remain in memory even when the list is overwritten, store the list in a Mutex<Arc<Data>> instead of an RWLock<Data>.
every one in a while, when your thread refreshes the list, it will lock the mutex, clone the Arc, and immediately drop the mutex to allow others to refresh / rewrite it.

When your thread wants to write to the list, create a new Arc of the new list, and write it into the mutex, replacing the previous Arc.

You'll have to decide how to best handle the fact that you're working with stale data. Maybe if you find a collision, you should check if the data has changed, or rerun it with the new data. Maybe you don't care.

2 Likes