Mandel-rust v0.3: more crates, more options

willi_kappler · January 30, 2016, 12:01am

Hi everyone,

so here is the new version (v0.3) of mandel-rust.

Changes in this release:

Three new methods / crates: Rayon par_iter, Rust scoped pool, Jobsteal
Two new command line options: --bench and --no-ppm

Rayon is still the fastest (for this specific benchmark) but Jobsteal is closer than the others.

(The old topic for the previous version can be found here, I hope it was OK to create a new topic for a new version)

For the next release I'll need to clean up / refactor the code and automate the benchmark part. And of course add new crates to test!

Just post comments, questions, results, etc. here.

LilianMoraru · January 30, 2016, 7:57am

It's strange that Jobsteal runs closer to 2 times as fast as the serial version, when using 1 thread.
May be it's using main thread + 1 more?

willi_kappler · January 31, 2016, 10:12pm

@LilianMoraru Thanks for pointing it out. First I thought it just was a copy and paste error but the author of Jobsteal contacted me and confirmed what you supposed. He also send me a pull request for a recursive join version of the mandelbrot function call which is now on par with Rayon. I'll re-run the benchmark and update the website soon.

leonardo · February 1, 2016, 12:22am

I have tried to run your code and it gives some warnings on a library, using the latest compiler:

...\winapi-0.2.5\src\macros.rs:159:46: 159:47 warning: `$fieldtype:ty` is followed by `[`, which is not allowed for `ty` fragments
...\winapi-0.2.5\src\macros.rs:159     ($base:ident $field:ident: $fieldtype:ty [
                                                                                                                                            ^
...\winapi-0.2.5\src\macros.rs:159:46: 159:47 note: The above warning will be a hard error in the next release.
...\winapi-0.2.5\src\macros.rs:159     ($base:ident $field:ident: $fieldtype:ty [

My timings, a oldish I7:

Configuration: re1: -2.00, re2: 1.00, img1: -1.50, img2: 1.50, max_iter: 2048, img_size: 1024, num_threads: 8
Time taken for this run (serial): 1436.62056 ms
Time taken for this run (scoped_thread_pool): 196.05883 ms
Time taken for this run (simple_parallel): 214.67468 ms
Time taken for this run (rayon_join): 184.67107 ms
Time taken for this run (rayon_par_iter): 199.02257 ms
Time taken for this run (rust_scoped_pool): 195.65157 ms
Time taken for this run (job_steal): 226.72547 ms
Time taken for this run (job_steal_join): 224.49086 ms

I think Cargo ha not compiled the packages using a native target-cpu, so perhaps some performance has being left on the table.

willi_kappler · February 1, 2016, 11:44pm

@leonardo Thanks for trying it out! I'm working mostly on Linux, and don't see the warnings (rust 1.6 and nightly). IIRC you're working on Windows with rust nightly. I can try it on a Windows machine.

I've also updated mandelbrot-rust (not yet 0.4): it uses n-1 threads for jobsteal and now also includes kirk + crossbeam.

The method jobsteal + join is now the fastest, even faster than Rayon for most cases (at least on our machine).

PeteVine · February 9, 2016, 2:50am

More silly LLVM stuff:

https://gist.github.com/petevine/b70b6e5a434f23b40ab5

Probably material for another bug report - or maybe you have an idea what's happening?

willi_kappler · February 9, 2016, 7:35am

@PeteVine Thanks for trying it out on various machines!

Rayon and jobsteal_join are doing fine, the others seem to have a higher overhead (which I also observed).
Some numbers have a high variance, this is why I'll add averaging for the next release inside the application. (I've done this manually before).

You may be wondering why an old petium can compete with a newer CPU: There is still work going on to support SIMD and other vectorisation stuff in the Rust compiler, Huon Wilson has already done some great job.

But you're right, I will add this to my TODO list.

PeteVine · February 9, 2016, 1:05pm

That was just one machine though and it's a little puzzling generated code stops scaling because of supposedly better optimizations.
Optimizing for Core2 should have been the fastest (and it almost scales again there). Maybe autovectorization is trying too hard with SSE2 available, also on x86_64?

EDIT:
Hey, that was it: -C target-feature=-sse2 restores the expected performance. I wonder if on newer cores you could get a small boost too or there's no difference. Care to try?

willi_kappler · February 10, 2016, 9:35pm

So disabling SSE2 did improve the performance ?
That's strange, but I'll give it a try. Thanks for finding it out!

PeteVine · February 10, 2016, 10:50pm

Strange indeed, I opened an issue against LLVM.
Even though Core2 x86_64 is affected too, it's not possible to disable SSE2 on 64-bit:

  Compiling strsim v0.4.0
LLVM ERROR: SSE2 register return with SSE2 disabled
Could not compile `strsim`.

So, if you're going to test a newer core, you'll have to run in 32-bit mode.

Topic		Replies	Views
New version of mandel-rust: uses Rayon, added benchmark announcements	37	6165	February 8, 2016
What's everyone working on this week (3/2026)? community	17	566	February 7, 2026
Strange result of example Mandelbrot	12	746	December 22, 2023
Understanding performance loss while performing simple copy kernel operation help	16	677	September 16, 2024
Mandelbrot in rust, single and multi-threaded	2	2383	December 29, 2015

Mandel-rust v0.3: more crates, more options

Related topics