No speedup for parallel loop with rayon


#1

I’m having trouble using rayon to parallelize a workflow. This is for a toy problem for advent of code but I find rayon to be interesting and I’d like to figure out if there’s something basic I’m missing or if there’s an issue with my code that I could fix.

My code is here:

This is for day 5 of this year’s advent of code challenge.

In this particular solution i’m using a snippet that looks like this to do a parallel loop over the letters of the alphabet:

fn best_react(contents: String) -> usize {
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ".par_chars().map(|test| {
        let this_contents = contents.replace(test, "");
        let this_contents = this_contents
            .replace(test.to_ascii_lowercase(), "");

        react(this_contents).len()
    } ).min().unwrap()
}

This compiles and runs, however I don’t see any speedup from multicore parallelism over replacing the par_chars call with a call to chars.

Is there something I’m missing here? Or is there something in my code that prevents the loop from getting parallelized?


#2

In general, the biggest issue you’re going to run into with pretty much any Advent of Code puzzle when it comes to Rayon is that you’re not going to be able to give enough work to Rayon to get enough benefits from parallelism to outweigh the overhead that rayon imposes.

For example, here you’re only parallelizing over a set of 26 elements, which is basically nothing. This could have good speed up if there’s a large amount of work per each element, but just based on a simple reading of this code, it doesn’t seem like there’s that much work, unless contents is a pretty large string.


#3

Each loop iteration takes about a few tenth of a second - the length of this_contents is about 10,000 characters. That seems like a decent amount of work to me.


#4

Ah if I artificially increase the size of the input I do see some speedup, so I guess my problem is too small.


#5

What’s rayon::current_num_threads? If you tweak the RAYON_NUM_THREADS environment variable, do you see any changes to the runtime?


#6

rayon::current_num_threads() evaluates to 4. time also tells me that I’m using more than 100% of my CPU when I use an artificially large input, but not when I’m using the smaller input values that I’ve checked in to the repo.


#7

At first glance… I’m stumped. The problem isn’t big, but it’s still enough that I’d expect some parallelism.

I first commented out the part1 react, and perf tells me that 91% of what remains is still in react, presumably called by best_react. And a further 7% is in memmove, probably from react shifting values around.

I even added a debug line to make sure they’re really spreading across threads:

        eprintln!("{:?} on {:?}", test, std::thread::current().id());

… which indeed shows all threads active. But overall I still only get like 1.032 CPUs utilized.


#8

Aha!

    println!("{}", best_react(react(contents)));

The majority of your time is spent in this first react call, before you ever enter the parallel best_react. If you change this to best_react(contents), then it does show some real parallelism.

(But then it takes longer overall, since they’re repeating some redundant work. And kudos, I actually didn’t realize it would be safe to pre-react it like this when I did the problem!)


#9

Ahh, cool! Thank you! When I said “artificially large input” I was just deleting that optimization. D’oh!

The rust community rocks, thank you for taking time to help.