Apparent 1.70.0 performance regression

Compiled old code with 1.70.0 to compare to 1.69.0 and found an apparent perf regression.

Found primes_ssoz times are slower while [twin|cousin]primes_ssoz are faster.

For primes_ssoz I get the best times compiled with just: cargo build --release.

For [twin|cousin]primes_ssoz I get best times compiled with:
RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo build --release

Here are my Linux laptop specs.

➜  ~ inxi
CPU: 8-core AMD Ryzen 9 5900HX with Radeon Graphics (-MT MCP-)
speed/min/max: 1264/1200/4679 MHz Kernel: 6.1.34-pclos1 x86_64 Up: 1d 23h 8m
Mem: 10002.6/39489.1 MiB (25.3%) Storage: 953.87 GiB (13.4% used) Procs: 426
Shell: Zsh inxi: 3.3.26

Here are time comparisons for various input values

-----------------------|-------------------|-------------------|-------------------|
                       |   primes_ssoz     | twinprimes_ssoz   | cousinprimes_ssoz |
     Inputs            |-------------------|-------------------|-------------------|
                       |  1.69.0 |  1.70.0 |  1.69.0 |  1.70.0 |  1.69.0 |  1.70.0 |
-----------------------|---------|---------|---------|---------|---------|---------|
   1_000_000_000_000   |   19.2  |   24.3  |   10.1  |   10.0  |    10.1 |    10.0 |
-----------------------|---------|---------|---------|---------|---------|---------|
   5_000_000_000_000   |  107.0  |  127.7  |   59.1  |   56.8  |    59.2 |    56.1 |
-----------------------|---------|---------|---------|---------|---------|---------|
  10_000_000_000_000   |  254.6  |  311.1  |  133.1  |  127.4  |   129.8 |   126.4 |

Here are the source files:

primes_ssoz - Primes generator, multi-threaded using rayon, using SSoZ (Segmented Sieve of Zakiya), written in Rust · GitHub

twinprimes_ssoz - Twinprimes generator, multi-threaded using rayon, using SSoZ (Segmented Sieve of Zakiya), written in Rust · GitHub

cousinprimes_ssoz - Cousin Primes generator, multi-threaded, using SSoZ (Segmented Sieve of Zakiya), written in Rust · GitHub

Could be the upgrade to LLVM 16: https://github.com/rust-lang/rust/pull/109474

That's interesting!

Since 90% of the code is the same in all 3 programs it must the way memory is allocated (layed out) at runtime. It thus seems even though [twin|cousin]primes_ssoz does more math (of the same type), they do fewer parallel steps. Could that be the difference? How parallel memory is being used (allocated by LLVM)?

Here's a comparison of the stripped binary sizes for each program for 1.69.0 and 1.70.0.

------------|-------------------|-------------------|-------------------|
            |   primes_ssoz     | twinprimes_ssoz   | cousinprimes_ssoz |
  Stripped  |-------------------|-------------------|-------------------|
  Binaries  |  1.69.0 |  1.70.0 |  1.69.0 |  1.70.0 |  1.69.0 |  1.70.0 |
     KB     |---------|---------|---------|---------|---------|---------|
            |  430.4  |  426.4  |  442.4  |  430.4  |  442.4  |  430.4  |

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.