How to make rayon's `iter.par_bridge()` as fast as `vec.into_par_iter()`?

Recently, I refactored my codebase to solve a memory issue. I saw that all the Vecs are taking up too much memory, so I decided to switch to lazy iterators instead. However, I encountered another surprise: The performance of my program was worse than the Vec version.

I then ran cargo flamegraph. I saw that in the iterator bridge version, there are multiple wide white holes (probably because only 1 thread is running there). The Vec version still has holes but they are narrower.

Is there a way to solve memory issue without compromising speed?

If you can, make it work in chunks. Instead of making Vec.iter() lazy, make Vec.chunks(n) lazy and lazily prepare n items at a time. This will amortise cost of distributing the work one by one.

1 Like

How to figure out which value of n is most optimal?

You have to try different values and measure the results; there are too many factors to predict.

But, if the value you choose is larger than (number of individual items / number of CPU cores) then you will lose some parallelism, and values near that order of magnitude will result in lopsided utilization, so that gives you an idea of the upper bound worth trying.

So, say you want the program to run well with 8 cores and typically 10,000 items to work on; then 10,000 / 8 = 1250. We want to stay well below that, and yet large enough to get significant savings from the chunking, so you might try these powers of 2 for chunk sizes: 16, 32, 64, 128, 256, 512.

1 Like

A big caveat with par_bridge() is that it serializes on a mutex for every call to your serial Iterator::next(). If this isn't producing items fast enough, most of the parallel threads will just be waiting for it.

Could you possibly convert more (or all!) of this lazy iterator source to a parallel iterator?

1 Like

std::fs::read_dir only returns a lazy iterator. If there is a way to instantly produce a random-accessible list of file names with known length, that would be great, but I doubt even the Linux kernel's C APIs have it (all the C API provides are null-terminated lists/arrays).

Also, I don't read the files but I do "stats" them (i.e. symlink_metadata).

Generally bigger is better, up to n = total number of items / number of cores.

For reading a directory, where you don't already have a ready batch of items to dispatch, you can make n keep increasing for every batch.

stating many files has high syscall overhead. You might want to look into io_uring and see if batching the stat calls provides a speedup.