New version of mandel-rust: uses Rayon, added benchmark

Hi everyone,

I've updated my small Rust mandelbrot example and it now uses Rayon as another way to show how to parallelize these kind of computations.

The readme file now also contains some images and benchmark results.

Rayon scales really nice and is surprisingly easy to use for numeric computation, if your problem is dividable into smaller chunks (which should be true for nearly all numeric problems).

I'll look into other crates like ArrayFire, Collenchyma and Timely Dataflow.

If you have questions, comments, etc., just post them here.

5 Likes

Wow, nice to see some benchmarks.
In theory Rayon should of worked well, nice to see that Rayon principles indeed work nicely.
Could you test Rayon by manually setting thread affinity(to lower the number of available cores)? - Actually, I guess I could do that myself since the code is available.

Could you also please add crossbeam in there?(Or it doesn't fit the benchmarks?)

Btw, I would expect ArrayFire to perform very well, "based on their advertising"(from the original project).

Hmm, I've put it in a virtual machine with a limit on 2 processors and 3 GB of RAM.
Here are my results:
Run 1:

Time taken for this run (serial_mandel): 2094.40230 ms
Time taken for this run (parallel_mandel): 1465.53542 ms
Time taken for this run (simple_parallel_mandel): 1447.04623 ms
Time taken for this run (rayon_mandel): 1556.87306 ms

Run 2:

Time taken for this run (serial_mandel): 2036.79616 ms
Time taken for this run (parallel_mandel): 1424.28748 ms
Time taken for this run (simple_parallel_mandel): 1443.81117 ms
Time taken for this run (rayon_mandel): 1546.72290 ms

Run 3(closed all other applications I had opened - idle applications):

Time taken for this run (serial_mandel): 2065.52431 ms
Time taken for this run (parallel_mandel): 1440.78751 ms
Time taken for this run (simple_parallel_mandel): 1462.32953 ms
Time taken for this run (rayon_mandel): 1567.82734 ms

Note that I updated time, num and clap crates and compiled with rustc 1.8.0-nightly (18b851bc5 2016-01-22)

Update:
Run 4, built against the latest Rayon from Git:

Time taken for this run (serial_mandel): 2077.69119 ms
Time taken for this run (parallel_mandel): 1459.22509 ms
Time taken for this run (simple_parallel_mandel): 1439.95422 ms
Time taken for this run (rayon_mandel): 1035.93366 ms

Nice! I see similar good results for rayon (Intel i5, 2 cores/4 threads, nightly Rust):

Time taken for this run (serial_mandel): 1273.77955 ms
Time taken for this run (parallel_mandel): 475.52886 ms
Time taken for this run (simple_parallel_mandel): 511.10182 ms
Time taken for this run (rayon_mandel): 456.43023 ms
// with rayon from git repository
Time taken for this run (rayon_mandel): 346.78605 ms

Can I suggest leaving the computation time out of the generated .ppms to allow easy consistency checks (e.g. md5sum *.ppm)?

Ok, challenge accepted.
i7-2630QM(4 cores / 8 threads) on Windows(from experience, on Linux should be faster):

Time taken for this run (serial_mandel): 1895.10763 ms
Time taken for this run (parallel_mandel): 499.77345 ms
Time taken for this run (simple_parallel_mandel): 501.61615 ms
Time taken for this run (rayon_mandel): 361.35119 ms
// with rayon from git
Time taken for this run (rayon_mandel): 312.15449 ms

The git version of rayon always takes 312-314ms.
Next I'll have to take my RPi 2 out and cross-compile...

Btw, rayon's speed seems more predictable. While the times for parallel_mandel and simple_parallel_mandel seem to jump a bit, rayon seems consistent in a small range(like: 312-314ms).

Hi LilianMoraru,

thanks for your feedback and providing some result!

I'll try to run Rayon with less cores to see how it compares and I'll also add Crossbeam on my TODO list.

Rayon from git repository seems to have improved, I'll run some tests with the newest version.

Hi birkenfeld,

nice to see more number :wink:

Since ppm is a text format, you can just use diff to make sure the results are correct. (It should then just show the run time as a difference).

But I can add a command line option to tun on computation time in the ppm files and disable it as a default.

I know. But instead of running diff several times it is still quicker to do md5sum :slight_smile:

Hi birkenfeld,

that's true - so your wish has been granted :wink:
The new version now supports the "--write_meta_data" flag which is off by default.
And I've also tested rayon git.

Crossbeam doesn't have a thread pool, is that right? So I have to come up with a different solution.

My mistake, crossbeam is mainly aiming at lock-free data structures(it offers a few more options).
It does not do threading stuff, although it offers spawning scoped threads.
So, from my understanding, you would use one of these libraries in combination with crossbeam but not one or the other, they do not overlap in features.

No problem :wink:
I'll then just use the scoped threads and think about a way to limit the number of threads that are spawned.
Or I find a way to combine them as you suggested.

Finally finished cross-compiling Rust(1.6 stable) and installing the new Raspbian.
On RPi 2(quad-core, full-desktop, num_threads=4):

Time taken for this run (serial_mandel): 13211.43625 ms
Time taken for this run (parallel_mandel): 3289.22000 ms
Time taken for this run (simple_parallel_mandel): 3311.23333 ms
Time taken for this run (rayon_mandel): 3291.76089 ms

Second run with num_threads=8(out of curiosity):

Time taken for this run (serial_mandel): 13212.37645 ms
Time taken for this run (parallel_mandel): 3291.99578 ms
Time taken for this run (simple_parallel_mandel): 3305.92166 ms
Time taken for this run (rayon_mandel): 3300.60385 ms

In both cases seems like parallel_mandel outperformed rayon(git) a little bit.
But I still think rayon seems the safest bet in this particular benchmark.

CPU Intel Core i7-4790 @ 3.60GHz (4 Cores | 8 Threads), 16GB RAM
Build with Rust 1.6 stable, in release mode

     Running `target/release/mandel`
Configuration: re1: -2.00, re2: 1.00, img1: -1.50, img2: 1.50, max_iter: 2048, img_size: 1024, num_threads: 2
Time taken for this run (serial_mandel): 1187.96743 ms
Time taken for this run (parallel_mandel): 835.18255 ms
Time taken for this run (simple_parallel_mandel): 850.95358 ms
Time taken for this run (rayon_mandel): 167.01803 ms

Hope this data helps.

Edit → Been playing recently with libdispatch (from Apple) on Linux and made some tests with dispatch crate (right now I'm looking what's in the ported libdispatch, what's in the crate, etc), with some basic operations working. When I have some time, I'll see if I manage to get a dispatch version of mandel-rust and see how it fares

Note that you used only 2 threads(that's the default).
For the other 2 parallel implementations to have a fair fight, in your case, you have to run mandel binary with --num_threads=8.

i.MX6Q(quad-core, 1.2 GHz per core, 2 GB of RAM), a lot of services run on it but all are idle.
Note that the system in this case uses softp so obviously the worst case for something working mainly with float numbers.

num_threads: 4
Time taken for this run (serial_mandel): 159389.71569 ms || 160336.35569 ms
Time taken for this run (parallel_mandel): 42583.61501 ms || 42044.68601 ms
Time taken for this run (simple_parallel_mandel): 42788.38400 ms || 42149.44934 ms
Time taken for this run (rayon_mandel): 42676.91767 ms || 42392.05867 ms

On ARM parallel_mandel seems to work quite alright(hf or softp)

Ok, updated numbers on i7 machine:

     Running `target/release/mandel --num_threads 8`
Configuration: re1: -2.00, re2: 1.00, img1: -1.50, img2: 1.50, max_iter: 2048, img_size: 1024, num_threads: 8
Time taken for this run (serial_mandel): 1190.79262 ms
Time taken for this run (parallel_mandel): 242.11304 ms
Time taken for this run (simple_parallel_mandel): 229.32596 ms
Time taken for this run (rayon_mandel): 166.80914 ms

Now, I've managed to add the linux port of libdispatch at rust-mandel patching up the already existing dispatch crate so I can rapidly make a test for the libdispatch port ... so adding that u it end ups like this on the i7 machine:

     Running `target/release/mandel --num_threads 8`
Configuration: re1: -2.00, re2: 1.00, img1: -1.50, img2: 1.50, max_iter: 2048, img_size: 1024, num_threads: 8
Time taken for this run (serial_mandel): 1190.36614 ms
Time taken for this run (parallel_mandel): 233.26369 ms
Time taken for this run (simple_parallel_mandel): 229.82853 ms
Time taken for this run (rayon_mandel): 161.29813 ms
Time taken for this run (dispatch_serial_mandel): 1209.53670 ms
Time taken for this run (dispatch_async_mandel): 862.91629 ms

This is the implementation for dispatch version, which does have notorious limitations ... some of them imposed by my knowledge of rust, some others by the crate:


#[cfg(feature = "with_dispatch")]
fn dispatch_serial_mandel(mandel_config: &MandelConfig, image: &mut [u32]){
    use dispatch::{Queue, QueueAttribute};

    let queue = Queue::create("com.rust.mandel", QueueAttribute::Serial);

    for y in 0..mandel_config.img_size {
        for x in 0..mandel_config.img_size {
            queue.sync(|| image[((y * mandel_config.img_size) + x) as usize] =
                mandel_iter(mandel_config.max_iter,
                    Complex64{re: mandel_config.re1 + ((x as f64) * mandel_config.x_step),
                              im: mandel_config.img1 + ((y as f64) * mandel_config.y_step)})
                          );
        }
    }
}

#[cfg(feature = "with_dispatch")]
fn dispatch_async_mandel(mandel_config: &MandelConfig, image: &mut [u32]){
    use dispatch::{Queue, QueueAttribute, Group};
    use std::sync::{Arc, Mutex};

    let queue = Queue::create("com.rust.mandel", QueueAttribute::Concurrent);
    let group = Group::create();

    let data = image.to_vec();
    let image = Arc::new(Mutex::new(data));

    for y in 0..mandel_config.img_size {

        for x in 0..mandel_config.img_size {

            let image = image.clone();
            let index = ((y * mandel_config.img_size) + x) as usize;
            let re =  mandel_config.re1 + ((x as f64) * mandel_config.x_step);
            let im = mandel_config.img1 + ((y as f64) * mandel_config.y_step);
            let max_iter = mandel_config.max_iter;
            let c = Complex64{re: re, im: im};

            group.async(&queue, move || {
                    let data = mandel_iter(max_iter, c);
                    let mut image = image.lock().unwrap();
                    image[index] = data;
                } );
        }
    }

    // Wait for all tasks in queue group to finish
    group.wait();
}

In any case, I'm tempted to test this, and a direct unsafe{} ffi test on FreeBSD which does natively have kqueue, libdispatch, clang and all the whistles and bells to see how it fares.

Regards.

Thanks for trying it out!

Yes that's also my impression and I'll probably use Rayon in my future project. I just need a way to limit the number of threads it uses. I'll have a look into the sources, it may not be too difficult to implement.

softfp may be sub-optimal, but still the scaling is nice!

Holy cow, never heard of libdispatch before, thanks a lot for implementing and trying it out!

Another suggestion: could you try the more high-level Rayon API (par_iter)? If it works, it should be even easier than the current version with join and perform the same.

Thanks @birkenfeld for pointing it out, I'll use Rayons par_iter and other crates in the next release