Why par_iter() is slower than iter()

This is my first attempt in using Rust. My code compiles, but I might still have some obvious mistakes here. I'm trying to perform some computations in parallel. However, when I use par_iter() from rayon, the code runs about twice slower than with iter().

Here is my code:

use indicatif::{ParallelProgressIterator, ProgressIterator};
use rayon::prelude::*;
...


    let positive_spectra: Vec<&Spectrum> = spectra.iter().filter(|s| s.properties["Ion_mode"] == "P").collect();
    let positive_index BTreeMap<NotNan<f32>, Vec<&Spectrum>> = create_index(&positive_spectra);
    let positive_mz_values: Vec<NotNan<f32>> = positive_index.keys().cloned().collect();

    let decoy_spectra: Vec<Spectrum> = positive_spectra.par_iter()
        .progress_count(positive_spectra.len() as u64)
        .map(|&spectrum|
            create_decoy_spectrum(spectrum, &positive_index, &positive_mz_values, 15))
        .collect();

I wonder if I'm doing something obviously wrong here and it's possible to import the parallel performance.

Without the actual code it's hard to say what's slowing you down. For example rayon has no .iter_par() (maybe you misspelled it and it actually was .par_iter()?) nor progress_count (for this one I can't find anything similar in rayon). And what does create_decoy_spectrum do? Not knowing what the methods in your code do means we can't say anything about its performance.

Is this the part where you meant par_iter()? And what is progress_count()?

One thing to be aware of is that rayon's collect::<Vec> is only optimal for indexed iterators, where it can write each item in parallel with direct indexing to their final location. Otherwise, it has to collect items into temporary buffers until it can finally aggregate them. If the rest of your computation is relatively fast, then this extra data movement could be significant overhead.

2 Likes

Also bare in mind that sending data to threads incurs a performance cost. When trying to parallelize an iteration, you must make sure that:

  • The actual work done on each element is big enough to amortize the cost of sending it to a different thread
  • The number of elements being iterated is large enough for the parallelization to make sense

Otherwise, sticking to a single thread will be more performant.

7 Likes

@SkiFire13 @cuviper Thank you for quick replies.

Yes, I misspelled par_iter()! The function progress_count() is from the package indicatif to show a progress bar. I've updated the post.

The function create_decoy_spectrum(spectrum, &positive_index, &positive_mz_values, 15) essentially reads data from spectrum, positive_index, and positive_mz_values and creates a new instance of Spectrum. It doesn't change spectrum, positive_index, and positive_mz_values in any way. Posting the entire script would probably take too much space and made it hard to read.

I'll try to debug it. I was just wondering if there are any obvious mistakes since I just started using Rust a few days ago.

In my case, create_decoy_spectrum() is pretty fast (a few milliseconds?), but I have about 500k items to run it on. I guess splitting them into batches and processing those batches in parallel would better?

Ok, I found that here: ParallelProgressIterator in indicatif - Rust

I think you should be able to use plain progress() there, but either way ProgressBarIter does preserve the "indexed" property, so that caveat I mentioned should be fine.

The other broad question is: are you running in --release mode? Performance comparisons aren't very useful in the default debug mode, and rayon does add some structural overhead that may get optimized away.

1 Like

@cuviper I compiled the code with cargo build --release and then run the executable from the folder target/release/. I didn't create any custom profiles. I assume cargo build --release should produce the optimized binary code?

You can do this in one command with cargo run --release.

Have you tried running it without the progress indicator?

That's pretty slow! And 500k items should be enough to notice an improvement. For a task that takes milliseconds of CPU time, I'd expect improvement from multithreading with 2 tasks. And rayon will split it into batches for you as long as it's indexed.

Make sure your single-threaded code is the same except using iter instead of par_iter. Otherwise, they might be doing different things.

Things that could make it slow:

  • You're dealing with global mutable state, like a mutex.
  • You have thread-local state that is expensive to initialize.
  • It's I/O bound.
  • It's memory speed bound.
  • Sequential iteration was relying heavily on cache and branch predictability, and something in the multithreaded version has ruined that.
4 Likes

@drewtato I've done some more testing. Running the code on an array of 73,000 items (spectra) takes about 28.5min when using iter(), about 22.5min when using par_iter(), and 22min when using par_chunks(100). So par_iter and par_chunks are slightly faster than iter, but I'd expect to have more performance improvement from the parallelization. I'm running it on a laptop with 6 cores (AMD Ryzen).

Interestingly, on a smaller array of items (24,000), I've noticed a better performance improvement from the parallel methods (15.15sec with par_chunks(100) vs. 34.38sec with iter()). Maybe it is something to do with large variables. Although, I'm passing their references, and I'm not cloning them (as far as I understand it).

I've read about using Arc and Sync. Do I still need to use them if I'm only reading the global variables, and not writing?

You need Sync no matter what, so you're fine there. One of the benefits of rayon is you don't need Arc.

By global variables, I mean things the function can access that aren't one of the parameters, like statics. You probably don't have those.

Are you doing any I/O? If not, it sounds like you're just memory bound. Actually, since you said you're on a laptop, it could even be power constrained, but that's typically very difficult to achieve.

I also checked and the only AMD 4500U has 6 logical cores. All the other mobile CPUs with 6 physical cores have SMT, so 12 logical cores, and rayon will use 12 threads by default.

Also tripling the item count and getting 100x slowdown is pretty rough. As always, the best way to gain performance is to use a better algorithm.

How are you reading them?

Ah, I meant the parameters that I pass to create_decoy_spectrum()

Yes, I understand, that the time complexity is pretty large. But changing algorithm is not the option right now. I'd like to make the current algorithm work as it is.

The program uses up to 5GB on my laptop. The total RAM is 16GB. Can Rust use all available RAM if needed? Or I need to specify the maximum heap size like in Java?

Rust can't automatically use more or less ram. If you want to use more, you need to figure out something to put in it. You're already doing that by creating positive_index and positive_mz_values which is good, but if there's any more duplicate work left in create_decoy_spectrum, then it'll probably be beneficial to make more collections.

But what I meant before is that you're limited by ram speed or latency.

Can anyone elaborate on this? I guess indexed iterator is IndexedParallelIterator, right? How much faster can this be, and how to create an iterator on ordinary structures with this trait?

Are you mixing up seconds and minutes there? That indicates that tripling the size of the data set increases the run time by a factor of 90 or so. That is worse that O(n²).

That suggests to me that any part of the problem depends on other parts of the problem, that they cannot be split out into independent chunks and cannot be parallelized very much.

This rather supports my guess.

Have a look at RwLock RwLock in std::sync - Rust

Yes. There's no VM to set the maximal heap size for. Rust compiles to a binary that directly runs on the machine and can request as much memory from the underlying OS as it can give.

Sync is a marker trait, so it's not something you actively use. Your types either are or aren't Sync, and there's nothing you can do about that fact; Sync is something you require as a trait bound on generics if you want your generic type/function to be usable in a concurrent context. [1]

And Arc is unnecessary if you have a true read-only global. Just declare the global in a static and you can access (read) it from anywhere else.


  1. You can unsafely implement Sync sometimes, but that's not something you should do in general, because you will get it wrong, and it's also irrelevant to this discussion. ↩︎

1 Like

Thank you! I'll take a look at RwLock.

The complexity is worse than O(n²), but at this point I cannot change the algorithm.

I'll try to make up a simpler example and post it here.

Knowing nothing about your problem my gut tells me that if totally independent data sets were being processed then running them at the same time on two cores would get the job done in pretty much twice the time. Assuming they are not saturating your memory bandwidth. The problems come when trying to split a job but working on one half requires data and or results from the other half. Then there is lots of coupling between the two and they end up fighting over resources or waiting for each others sub results. And I suspect with large data a lot of cache thrashing starts to happen. Essentially serialising the two threads into one again.

How about a simple experiment: Run your program twice at the same time with the same data. Being independent process we would like to see the two jobs finish in the same time as running just one. This would at least show that your machine can actually do that much work spread over two of its cores.

If that is so I suspect your algorithm must does not parallelise well.

1 Like