Why my Rust multithreaded solution is slow as compared to the same c++ solution?

Recently I was solving one of the Project Euler problems #94 in C++ and thought of implementing the same in Rust and comparing their execution times

SPOILER ALERT: contains a solution for the project Euler problem #94

Turns out the same implementation in Rust is twice as slow as compared to C++ solution
I would like to know the cause of this. Thank you!

#include <chrono>
#include <cmath>
#include <iostream>
#include <thread>
#include <vector>

const uint64_t MAX = 1'000'000'000; // Set the value of MAX here
const uint64_t THREAD_COUNT = 10;

void calculatePerimeter(uint64_t thread_id, uint64_t& perimeter) {
	uint64_t start_a = 1 + thread_id;

	for (uint64_t a = start_a; 3 * a + 1 <= MAX; a += THREAD_COUNT) {
		std::array<uint64_t, 2> c_values = { a - 1, a + 1 };
		for (uint64_t c : c_values) {
			if (c > 0) {
				if (2 * a + c > MAX) {
					continue;
				}
				uint64_t second = 4 * a * a - c * c;
				uint64_t root = static_cast<uint64_t>(std::sqrt(second));
				if (root * root != second) {
					continue;
				}
				second = root;
				uint64_t numerator = c * second;
				if (numerator > 0 && numerator % 4 == 0) {
					perimeter += 2 * a + c;
				}
			}
		}
	}
}

int main() {
	std::vector<std::thread> threads;
	std::vector<uint64_t> perimeters(THREAD_COUNT);

	auto start_time = std::chrono::high_resolution_clock::now();

	for (uint64_t thread_id = 0; thread_id < THREAD_COUNT; ++thread_id) {
		threads.emplace_back(calculatePerimeter, thread_id, std::ref(perimeters[thread_id]));
	}

	uint64_t total_perimeter = 0;
	for (auto& thread : threads) {
		thread.join();
	}
	for (const auto& perimeter : perimeters) {
		total_perimeter += perimeter;
	}

	auto end_time = std::chrono::high_resolution_clock::now();
	auto elapsed_time =
	std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);

	std::cout << "Answer = " << total_perimeter << std::endl;
	std::cout << "Elapsed time: " << elapsed_time.count() << " milliseconds" << std::endl;

	return 0;
}

use std::thread;
use std::time::Instant;

fn main() {
    const MAX: u64 = 1_000_000_000; // Set the value of MAX here
    const THREAD_COUNT: u64 = 10;

    let mut handles = Vec::with_capacity(THREAD_COUNT as usize);
    let start_time = Instant::now(); // Start measuring the execution time

    for thread_id in 0..THREAD_COUNT {
        let handle = thread::spawn(move || {
            let start_a = 1 + thread_id;
            let mut perimeter = 0;

            for a in (start_a..).step_by(THREAD_COUNT as usize) {
                let c_values = [a - 1, a + 1];
                for &c in &c_values {
                    if c > 0 {
                        if 2 * a + c > MAX {
                            continue;
                        }
                        let mut second = 4 * a * a - c * c;
                        let root = (second as f64).sqrt() as u64;
                        if root * root != second {
                            continue;
                        }
                        second = root;
                        let numerator = c * second;
                        if numerator > 0 && numerator % 4 == 0 {
                            perimeter += 2 * a + c;
                        }
                    }
                }
                if 3 * a + 1 > MAX {
                    break;
                }
            }

            perimeter
        });

        handles.push(handle);
    }

    let mut total_perimeter = 0;
    for handle in handles {
        total_perimeter += handle.join().unwrap();
    }
    let elapsed_time = start_time.elapsed(); // Calculate the elapsed time

    println!("Answer = {}", total_perimeter);
    println!("Elapsed time: {:?}", elapsed_time);
}

UPD: I was using rustc to create the binary

Obligatory question:

Are you running the Rust code in release mode?

cargo run --release

7 Likes

One difference I can find with perf is that Rust has extra comparisons on the float to implement a safe saturating conversion here:

let root = (second as f64).sqrt() as u64;

There's a fair gain by using the unchecked version:

let root = unsafe { (second as f64).sqrt().to_int_unchecked::<u64>() };
3 Likes

I used rustc main.rs && ./main.rs earlier.
As you said, --release made is really fast now.
It executes in ~1s now! Yay!

For fast PE algorithms, this SO answer is almost obligatory :wink:

1 Like

Agreed, I also got nontrivial perf increase by using unchecked float-to-int conversion in my software rasterizer. IIRC I couldn’t get LLVM to optimize out the checks even though it was in a loop from a:f32 to b:f32 (both asserted to be finite before the loop), incrementing by 1.0f32 per iteration.

1 Like

If you do want to use rustc directly for some reason (though there aren’t many), the magic words are rustc -O (there are other optimization-related flags but that’s the most important one). You’ll also likely want to use --edition=2021 because the default is the eight-year-old 2015 edition. But mostly, just use Cargo.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.