I have some code I'm trying to parallelize using threading with shared-state concurrency. Specifically, I'm trying to simulate diffusion-limited aggregation, but understanding (the physics behind) that in detail should not be necessary to get my problem.
In short, I am spawning a number of threads that each do some stuff to generate a particle position. The threads then need to check for collision between other particles using a central list of "fixed" particles. If there is a collision, the thread needs to mutate this central list.
The collision checking is the most expensive operation in the simulation, and collisions are relatively uncommon (they don't happen every timestep), so I thought to parallelize this by wrapping this shared state in a RwLock.
The code looks like:
let shared_state = RwLock::new(SharedState {
cell_list,
n_fixed: 0,
});
let mut fixed_particles: Vec<FixedParticle> = Vec::new();
thread::scope(|s| {
let mut handles: Vec<_> = (0..args.n_threads-1)
.map(|_| s.spawn(|| {
let mut p = new_random_particle(u, n, &mut thread_rng(), args.timestep);
let mut fixed_particles: Vec<FixedParticle> = Vec::new();
loop {
// Brownian motion
integrator.integrate_a(&mut p);
// Block on read access to shared state
let rg = shared_state.read();
if rg.cell_list.check_overlap(p.position) {
let glue_position = p.position;
// Randomise particle position
p = new_shell_particle(n, &mut thread_rng(), r1(rg.max_r_sq.sqrt()), seed.position);
// Release read lock and block on write lock:
drop(rg);
let mut wg = shared_state.write();
wg.cell_list.insert(glue_position)
wg.n_fixed += 1;
drop(wg);
fixed_particles.push(FixedParticle { position: glue_position, timestep: i });
}
}
fixed_particles
}))
.collect();
for handle in handles {
fixed_particles.append(&mut handle.join().unwrap());
}
});
However, I noticed that the performance barely scales with the amount of threads. Moreover, profiling revealed that the program spends almost all its time in acquiring the read lock:
This was surprising to me. For testing purposes, I completely commented out the part where the code mutates the shared state, removing the write lock altogether. Strangely, the behavior now is still completely the same.
So, even when the lock is 100% uncontended (as there are no writers whatsoever), the performance of acquiring a read lock on the RwLock is still really poor!
I tried this with std::sync::RwLock
as well as with parking_lot::RwLock
, but the results are basically the same.
Is this expected behavior, or am I doing something wrong here?