How can I reduce the wait time for threads

I am looking for a better way to multithread some performance critical code. Using locks on my data was too slow so I made something that, instead of applying the operations straight away, collects the operations that are performed on it. This way only immutable access is needed while threading.

(This current implementation is much faster than all of the ones I tried that had locks. Including DashMap and such.)

Those operations can be applied later when we have mutable access. The problem, with the way I do it now, is that I have to wait for each thread to finish the fill_operations. Each call of that function doesn't last the same amount of time. Which means I am wasting loads of potential speed waiting for the scoped thread to join.

let mut operations = vec![vec![]; n_cpu];
loop {
    rayon::scope(|s| {
        for ops in operations.iter_mut() {
            s.spawn(move |_| {
                fill_operations(ops, &data);
            });
        }
    }
    for ops in strategy_ops.iter_mut() {
        data.apply_operations(ops);
        ops.clear();
    }
    if exit_condition {
        break;
    }
}

I have heard of the left-right crate which I think would be perfect for my use case except for the fact that it needs to hold double my data. The problem with that is that data is very big.

Is left-right my only choice to making this lockless? Any other suggestions on how this could be made faster would be very helpful.

What kind of data structure is data?

A custom struct that holds custom HashOnlyMaps which don't store the keys because collisions are not a problem in my use case. The struct has fields for HashOnlyMap<[f32; N]> where N is all the numbers from 2-15. There are also two other smaller fields in there but those are not that important.

The way you index into it is by providing N and something that can be hashed. But it doesn't just give back a simple &[f32; N] there is a custom Entry api for it.

So, as far as I understand, your fill_operations does a lot of reads on data and probably some computations and generates a small number of mutating operations to be applied? And you want to repeat this a lot of times and parallelize it, but apparently each call of fill_operations doesn’t necessarily need to see the modifications of the most recent “previous” calls.

And data consists of a handful of pretty large maps, right? Presumably each fill_operations call needs to read and creates write operations for many of the maps in data?

Just trying to understand the setting better, correct me if I got something wrong.

Yes exactly right.

Yes most of them are pretty large. And yes they do need to read and write to multiple of the maps. Luckily inserts are somewhat few after the beginning stage. which means mostly just writing to certain places.

You got it perfectly right :slight_smile:

Another important point which might be somehow used for performance gain is that each fill_operations is very unlikely to modify the same entries as another fill_operations (if n_cpu = 8). So after the initial fast growth of the maps this could be used to locklessly update those values (as long as no other is using those values). But this is mostly something that I would try to do after I figure out a better way to do the major fill_operations loop.

You suggested you also previously had an approach using locks. So it seems like the fill_operations code wouldn’t really care if it gets interrupted by modifications of data?

For now, I feel like you’re right in that using left_right might be a good option. Honestly, a factor of 2 in data usage isn’t terribly uncommon, anyways. A Vec is also usually having a capacity that’s twice its length after it resizes (growing). One thing I also had in mind is whether immutable/persistent data structures might be a reasonable approach. If you don’t have too much mutation of your data structure, it can help avoid the space overhead of duplicating everything.

Another approach for solving the “wasting loads of potential speed waiting for the scoped thread to join” problem might be: you could try to break up fill_operations into multiple steps or join multiple fill_operations calls together (depending on what gives the right granularity/scale) to be able to even out the scheduling a bit between the “collect everything together and mutate” breaks.

1 Like

Those are good suggestions. Breaking fill_operations up might not be a bad idea but it could be very hard in my case (essentially tree traversal where we pass up values).

What kind of persistent data crates would you suggest? Could they be indexed just as fast as a HashMap? The speed of getting the values is very important.

The way DashMap and others did locking made getting values slower than using a normal HashMap which is not too useful for me. Sometimes double the time it would take with just a HashMap. But I made loads of changes since I compared them so I might try that out. But this would require loads of changes too.

I think I will try out left-right and see where that goes because it seems the simplest solution to me right now.

Thank you very much @steffahn :smiley:

I don’t have practical experience using immutable data structures in Rust myself. I know about the im crate which has a HashMap type. I don’t know about how much performance you lose from it. The crate itself claims that its HashMap is similar in speed to std’s HashMap. (Maybe that’s referring to the non-thread-safe version only, since it also claims a 20-25% increase in general performance for the non-thread-save version over the thread-safe one.)

1 Like

I expect par_iter_mut().for_each(|ops| ...) would perform better than scope with individual spawns, but the data structure issues are likely more important.

That’s a way to do the “join multiple fill_operations calls together” approach, if you increase the size of the operations vec to a few time bigger than n_cpu, right?

I tried both but surprisingly it was actually a tiny bit slower. I was stunned as well because in all other cases the par_iter was better. The rayon::scope was around 40k iter/sec while the par_iter was 35k iter/sec.

I also tried crossbeam and scoped-thread. crossbeam was slower than running single threaded. So about 20k iter/sec. (I suspect because it started new threads with each spawn). While `scoped-thread was around 30k iter/sec.

Running single threaded is 23k per/sec just for comparison.

The way I did it was using par_iter_mut() on a Vec who's items had all the necessary information for running a fill_operations. The length of the vector was the number of CPUs. And of course I reused the vector so I wouldn't have to allocate a new one every time. The outer loop was the same.

Weird, I would expect spawn to be strictly worse, if you keep everything else the same, because it allocates and uses dynamic dispatch. If those bits aren't in the hot path, then the par_iter code shouldn't matter either. But if you've measured this, that's fine.

So there’s a question about requirements. Would it be acceptable to do more than one fill_operation per thread before collecting and applying the operations? If it’s like 3 or 10 iterations or so per thread before mutating data again, then load-balancing by rayon can reduce the problem of each call lasting different amount of time, you’d need to use an array of size 3*n_cpu or 10*n_cpu or something like that and call par_iter_mut() on it.

By the way, the using left-right approach or using immutable data structures has the additional benefit that the mutation part can run in parallel with the worker threads continuing to do useful work.

additional benefit that the mutation part can run in parallel

That is exactly what I was thinking just now :smiley:
Not sure what percentage of the time is actually spent on the update though. But any gain is a gain.

It would take a lot longer to get though the beginning stage if I had updates less frequently. I am planning on running this on a server with more than 64 cores so the infrequent update might already make convergence slower.

It seems more and more likely that the correct way to go is with the left-right/immutable approach. (I am reading through the docs now and once I implemented one I will comment the speed for people who might find it useful.)

Really appreciate the suggestions. Thank you.