New version of mandel-rust: uses Rayon, added benchmark

Thanks @birkenfeld for pointing it out, I'll use Rayons par_iter and other crates in the next release

I was thinking of porting the code to use par_iter (though I hadn't yet read into it to see how easy that would be). Not sure if I'll have time. In any case, I want to add some version of this in the rayon repository as the start of a local benchmark suite.

It would have been faster to use a binary distribution from here.

@willi_kappler I repeated the benchmark a few times (working from ramdisk) and it ran out of space - cleaning the ppm files looks like a great idea. Maybe with a --bench switch doing num_cpus detection too?

Here's another armv7 result (NEON makes no difference):

Configuration: re1: -2.00, re2: 1.00, img1: -1.50, img2: 1.50, max_iter: 2048, img_size: 1024, num_threads: 8
Time taken for this run (serial_mandel): 7096.58934 ms
Time taken for this run (parallel_mandel): 1796.89844 ms
Time taken for this run (simple_parallel_mandel): 1803.06047 ms
Time taken for this run (rayon_mandel): 1820.46755 ms

running on Odroid C1 @ 8 threads.

Where is that further speedup coming from on x86_64?

Hi @nikomatsakis,

I'm currently working on par_iter, will release a new version of mandelbrot-rust today. Feel free to take / use the code for your benchmarks.

@PeteVine: I've added a --no_ppm switch that disables image generation and a --bench switch for doing autodetection with num_cpus

1 Like

Well, I do have a C1 and U3 in my closet. Probably time to get them out and put to some work :slightly_smiling:

Just let us know when you've pushed the changes. Should there be a small boost from rayon or is it totally platform dependent?

The, sadly discontinued, U3 is the best (problem-free) Odroid even to this day. You should definitely use it more!

The new version (v0.3) has been published and I've created a new topic.

@PeteVine: num_cpus is platform independent (AFAIK) and Rayon has already been using all the cores per default (currently there is no option so set the maximum number of threads that Rayon is allowed to use, but there is a pull request on the way).

But for v0.3 I'm using a second method of Rayon to run things in parallel: besides join() there is also par_iter() which is a bit faster most of the times (for the mandel benchmark).

I was referring to the fact x86_64 results were showing a small boost from using rayon (compared to other implementations at the same number of threads) whereas on arm there was no difference.

@PeteVine Ah sorry I misunderstood. I don't think Rayon is specifically targeted to x86_64 but I think that LLVM does have some special optimizations for x86_64 that are missing on ARM. Could also be that the work stealing algorithm is more friendly to the x86_64 (cache) architecture (Intel bought cilk and developed it further to cilk plus, Rayon and jobsteal are based on these ideas.)

Definitely related to the maturity of the ARM backend.

Did a test run of this on my server machine.

It's a AMD Opteron(TM) Processor 6272 with 4 CPUs and 16 cores each so 64 cores in total and 128 GB ram. This test is with using --num_threads=64

Configuration: re1: -2.00, re2: 1.00, img1: -1.50, img2: 1.50, max_iter: 2048, img_size: 1024, num_threads: 64
Time taken for this run (serial): 2196.44017 ms
Time taken for this run (scoped_thread_pool): 142.00686 ms
Time taken for this run (simple_parallel): 133.63295 ms
Time taken for this run (rayon_join): 107.26168 ms
Time taken for this run (rayon_par_iter): 85.79211 ms
Time taken for this run (rust_scoped_pool): 121.52345 ms
Time taken for this run (job_steal): 245.41874 ms
Time taken for this run (job_steal_join): 181.38436 ms
Time taken for this run (kirk_crossbeam): 113.94865 ms

@emoon Cool, thanks for posting your results! That's a nice machine :wink:
Rayon is faster than jobsteal on your machine. Have you tried the stable version (tagged v0.3) ?

I've changed the configuration for jobsteal in current git, the author contacted me and opened an issue. If you have time (and motivation :slightly_smiling: ) could you please try the current git version ?

If you already did run the git version, it seems like rayon is more efficient if the number of cores increases. (Or there may be some other effects I'll have to figure out).

It's a 64 cores machine, and computing a Mandelbrot set image is an embarrassingly parallel task (but the file save is serial), yet with rayon_join it takes 107 ms, while on my oldish I7 laptop with 4 cores (plus hyperthreading) with rayon_join takes 184 ms. I expected 10-20 ms on @emoon machine :slight_smile:

Newer version running on Cortex-A5 @ 1.7Ghz:

$ target/release/mandel --bench --no_ppm
Configuration: re1: -2.00, re2: 1.00, img1: -1.50, img2: 1.50, max_iter: 2048, img_size: 1024, num_threads: 4
Time taken for this run (serial): 7086.33729 ms
Time taken for this run (scoped_thread_pool): 1800.55646 ms
Time taken for this run (simple_parallel): 1811.27751 ms
Time taken for this run (rayon_join): 1812.63752 ms
Time taken for this run (rayon_par_iter): 1832.46361 ms
Time taken for this run (rust_scoped_pool): 1796.70744 ms
Time taken for this run (job_steal): 1806.25949 ms
Time taken for this run (job_steal_join): 1816.23053 ms
Time taken for this run (kirk_crossbeam): 1799.96446 ms

@emoon What about your big-little setup? :slight_smile:

Thanks :slight_smile: Yeah it's a nice box.

This was the current git version I tested.

Regarding scaling if it would scale perfect it should be around ~30 ms if one looks at serial time and if it would scale perfect so perhaps there is some overhead here that causes it not to scale perfect.

Also FPU performance is a bit weak in the CPU compared to Intel CPUs while integer is usually better (so compiling code in parallel on the machine is quite fast)

Yes, that's true and as @emoon pointed out there is some overhead that causes the slowdown. For the next release I'll change the default settings to a bigger image size and to more iterations. The calculation of the mandelbrot set will take more time but the effect of the overhead should be much smaller.

A friend of mine is doing this for C++, so it would be nice to see how Rust (+ LLVM) compares to g++ / clang.

Now I have a doubt ... is rayon join() function blocking? I mean, if that function ends when all queued task finish or it just enqueues the task and the threadpool works behind the scenes while code keeps going.

Yes, it block and this allows to capture values by references in the closure argument of join.

1 Like

It is blocking and does even s.th. called "potential parallelism", that is if all your cores are busy it executes both closures in serial. See this blog post by @nikomatsakis (author of Rayon).

1 Like