Release code slower than debug when using rayon

To learn Rust I’m doing a few small project, the last was an automatic lawn mover simulation, GitHub - johan162/gridcover

This had a very surprising result: The debug build is almost 8 times faster execution speed compared with release build if I enable parallelism through rayon admittedly with simulation parameters that gives no benefit for a parallel approach (but all the overhead).

The result from the debug build using rayon

Simulation Results:
  Simulation completed in  : 00:24
  ...

and from the release build using rayon

Simulation Results:
  Simulation completed in  : 03:52
  ...

If I disable the use of par_iter the release build completes in 2s and the debug build in 18s.

The parameters in this particular case gives no benefit of a parallell iterator so the overhead makes it unsuitable which I fully get. However, I don't get why (in principle) the release build would be soo much slower?

Are there some principle explanation for this behavior (or will it be down to nitty gritty rayon impl. details)?

Cheers,

P.S. Parameters used was:

$ gridcover -r 4 -w 500 -g 400 -s 0.5 -x 1 -y 1 -v 1 -c 99 -S 99 -C false

What kind of system are you running on? Are you sure you didn't just mix up your builds or results?

Using Rust 1.87.0 on Fedora 42 with AMD 9800X3D, I got these results from /bin/time:

  • debug 88.48user 6.65system 0:07.08elapsed 1342%CPU
  • release 15.80user 10.65system 0:01.98elapsed 1331%CPU

No, I'm not mixing them up :slight_smile: That was the first thing I double and tripple checked.
As you see I get the expected result once I add the option that uses the linear (and not the parallel version) of the simulation code.

My main dev. environment is a 2019 MacPro, Intel Xeon18-cores (36 threads) which might be why the overhead is killing the parallel version (when it really is just overhead for no gain).

HOWEVER If I try it on a Linux server I get the expected result (!!)

First the Linux debug version (notice the debug path)

./target/debug/gridcover -r 4 -w 500 -g 400 -s 0.5 -x 1 -y 1 -v 1 -c 99 -S 99 -C false
Coverage: 99.00% (198000/200000 cells covered) - Bounces: 737 - Sim-Time: 29:01:20
Simulation Results:
  Simulation completed in  : 00:21
  Simulated elapsed time   : 29:01:20
  Coverage                 : 99.0% (198000 out of 200000 cells)
  Distance traveled:       : 104481.0 units
  Number of bounces:       : 737
  Total Simulation Steps   : 232,180 (Using step size = 0.45 units)

then the Linux release version

./target/release/gridcover -r 4 -w 500 -g 400 -s 0.5 -x 1 -y 1 -v 1 -c 99 -S 99 -C false
Coverage: 99.00% (198000/200000 cells covered) - Bounces: 737 - Sim-Time: 29:01:20
Simulation Results:
  Simulation completed in  : 00:05
  Simulated elapsed time   : 29:01:20
  Coverage                 : 99.0% (198000 out of 200000 cells)
  Distance traveled:       : 104481.0 units
  Number of bounces:       : 737
  Total Simulation Steps   : 232,180 (Using step size = 0.45 units)

So this is something specific for a Mac !!

Just for completeness, the exact same output on my Mac Pro,

The Mac Pro debug version

./target/debug/gridcover -r 4 -w 500 -g 400 -s 0.5 -x 1 -y 1 -v 1 -c 99 -S 99 -C false
Coverage: 99.00% (198000/200000 cells covered) - Bounces: 737 - Sim-Time: 29:01:20
Simulation Results:
  Simulation completed in  : 00:28
  Simulated elapsed time   : 29:01:20
  Coverage                 : 99.0% (198000 out of 200000 cells)
  Distance traveled:       : 104481.0 units
  Number of bounces:       : 737
  Total Simulation Steps   : 232,180 (Using step size = 0.45 units)

followed by the Mac Pro release version

./target/release/gridcover -r 4 -w 500 -g 400 -s 0.5 -x 1 -y 1 -v 1 -c 99 -S 99 -C false
Coverage: 99.00% (198000/200000 cells covered) - Bounces: 737 - Sim-Time: 29:01:20
Simulation Results:
  Simulation completed in  : 03:15
  Simulated elapsed time   : 29:01:20
  Coverage                 : 99.0% (198000 out of 200000 cells)
  Distance traveled:       : 104481.0 units
  Number of bounces:       : 737
  Total Simulation Steps   : 232,180 (Using step size = 0.45 units)

So, to summarize:

Enabling parallel processing (par_iter)

  debug release
Linux 00:21 00:05
Mac Pro 00:28 03:15
Mac M1 00:12 00:05

Given that both the Mac Pro and Linux uses Intel Xeon CPU this huge, huge, difference is astonishing.

Disabling parallell processing ( no par_iter)

  debug release
Linux 00:14 00:02
Mac Pro 00:18 00:02
Mac M1 00:12 00:00

Sidenote: The compilation time on Linux is ~40% faster!

I guess things like L1 cache trashing could play a role but it doesn't explain why the Linux server doesn't exhibit this behavior. Fully explaning this would be some serious detective work that would take more time than I have.

Edit: Added Row for Mac M1

That's wild. I don't have any Mac to try on, but maybe Xcode's Instruments app can give you a clue?

You can profile it with

I think you can shared the collected profiles as well.

.

Profiling is the right way to go for this, and it is useless to speculate. That said, this smells like lock contention, cache line contention or false sharing.

One thing worth trying is using one of the many ways in rayon to chunk the parallel iterator into bigger blocks.

But it could also be that you have contention on some lock unrelated to rayon, for example in the memory allocator of the OS. (On Linux you could use jemalloc or mimalloc as alternative global allocators, but I have no idea if that is supported on OS X.)

And even apart from that there are many other locations that could contention on locks or cache lines. Both in your code or in the OS. Or as a result of how the OS allocates memory on behalf of your code. Profiling really is the only way to dig into this without blindly trying things.