Mandel-rust v0.3: more crates, more options

Hi everyone,

so here is the new version (v0.3) of mandel-rust.

Changes in this release:

  • Three new methods / crates: Rayon par_iter, Rust scoped pool, Jobsteal
  • Two new command line options: --bench and --no-ppm

Rayon is still the fastest (for this specific benchmark) but Jobsteal is closer than the others.

(The old topic for the previous version can be found here, I hope it was OK to create a new topic for a new version)

For the next release I'll need to clean up / refactor the code and automate the benchmark part. And of course add new crates to test!

Just post comments, questions, results, etc. here.

It's strange that Jobsteal runs closer to 2 times as fast as the serial version, when using 1 thread.
May be it's using main thread + 1 more?

@LilianMoraru Thanks for pointing it out. First I thought it just was a copy and paste error but the author of Jobsteal contacted me and confirmed what you supposed. He also send me a pull request for a recursive join version of the mandelbrot function call which is now on par with Rayon. I'll re-run the benchmark and update the website soon.

I have tried to run your code and it gives some warnings on a library, using the latest compiler:

...\winapi-0.2.5\src\macros.rs:159:46: 159:47 warning: `$fieldtype:ty` is followed by `[`, which is not allowed for `ty` fragments
...\winapi-0.2.5\src\macros.rs:159     ($base:ident $field:ident: $fieldtype:ty [
                                                                                                                                            ^
...\winapi-0.2.5\src\macros.rs:159:46: 159:47 note: The above warning will be a hard error in the next release.
...\winapi-0.2.5\src\macros.rs:159     ($base:ident $field:ident: $fieldtype:ty [

My timings, a oldish I7:

Configuration: re1: -2.00, re2: 1.00, img1: -1.50, img2: 1.50, max_iter: 2048, img_size: 1024, num_threads: 8
Time taken for this run (serial): 1436.62056 ms
Time taken for this run (scoped_thread_pool): 196.05883 ms
Time taken for this run (simple_parallel): 214.67468 ms
Time taken for this run (rayon_join): 184.67107 ms
Time taken for this run (rayon_par_iter): 199.02257 ms
Time taken for this run (rust_scoped_pool): 195.65157 ms
Time taken for this run (job_steal): 226.72547 ms
Time taken for this run (job_steal_join): 224.49086 ms

I think Cargo ha not compiled the packages using a native target-cpu, so perhaps some performance has being left on the table.

@leonardo Thanks for trying it out! I'm working mostly on Linux, and don't see the warnings (rust 1.6 and nightly). IIRC you're working on Windows with rust nightly. I can try it on a Windows machine.

I've also updated mandelbrot-rust (not yet 0.4): it uses n-1 threads for jobsteal and now also includes kirk + crossbeam.

The method jobsteal + join is now the fastest, even faster than Rayon for most cases (at least on our machine).

More silly LLVM stuff:

https://gist.github.com/petevine/b70b6e5a434f23b40ab5

Probably material for another bug report - or maybe you have an idea what's happening?

@PeteVine Thanks for trying it out on various machines!

Rayon and jobsteal_join are doing fine, the others seem to have a higher overhead (which I also observed).
Some numbers have a high variance, this is why I'll add averaging for the next release inside the application. (I've done this manually before).

You may be wondering why an old petium can compete with a newer CPU: There is still work going on to support SIMD and other vectorisation stuff in the Rust compiler, Huon Wilson has already done some great job.

But you're right, I will add this to my TODO list.

That was just one machine though and it's a little puzzling generated code stops scaling because of supposedly better optimizations.
Optimizing for Core2 should have been the fastest (and it almost scales again there). Maybe autovectorization is trying too hard with SSE2 available, also on x86_64?

EDIT:
Hey, that was it: -C target-feature=-sse2 restores the expected performance. I wonder if on newer cores you could get a small boost too or there's no difference. Care to try?

So disabling SSE2 did improve the performance ?
That's strange, but I'll give it a try. Thanks for finding it out!

Strange indeed, I opened an issue against LLVM.
Even though Core2 x86_64 is affected too, it's not possible to disable SSE2 on 64-bit:

  Compiling strsim v0.4.0
LLVM ERROR: SSE2 register return with SSE2 disabled
Could not compile `strsim`.

So, if you're going to test a newer core, you'll have to run in 32-bit mode.