Unexpected parallel sort performance

Hello Rust community!

I would love to receive some thoughts on why a safe parallel mergesort performs better than an unsafe implementation.

I compare three simple parallel merge sorts (all implementations and benchmark can be found here):

  1. Slices, which passes two halves of an array for parallel processing using split_at_mut
  2. Slices Unchecked, which does the same thing but uses unchecked operations
  3. Single Array, which operates on *mut i32 thereby does not need any splitting

What I expect:
In both parallel and sequential case, Slices is slower than both Slices Unchecked and Single Array.
What I observed after benchmarking:

  • In sequential case Slices Unchecked was 2% faster, Single Array was 6.6% faster
  • In parallel, Slices Unchecked was 4.7% faster, Single Array was 14.8% SLOWER!

Why when I switch to parallel execution, Single Array sorting method performs worse than Slice? Single Array doesn't do any index bounds checking, so doesn't Slice Unchecked, so I would assume they would perform similarly, as in sequential case. What could be the reasons?

More detailed benchmarking results

Sequential case

Array Size Slices (baseline) Single Array Diff Slices Unchecked Diff
512 16µs 15µs -7% 16µs 0%
2048 77µs 70µs -8% 76µs -1%
16384 761µs 687µs -9% 730µs -4%
65536 3290µs 3000µs -8% 3195µs -2%
250k 14ms 13ms -4% 13.9ms -0.7%
1kk 62ms 57ms -8% 60ms -3%
5kk 332ms 325ms -2% 322ms -3%
10kk 737ms 684ms -7% 716ms -3%

Parallel case

Array Size Slices (baseline) Single Array Diff Slices Unchecked Diff
250k 15ms 18ms +20% 15.2ms +1%
500k 26ms 31ms +19% 25.8ms -1%
2kk 60ms 69ms +15% 58.6ms -2%
8kk 141ms 156ms +10% 133ms -6%
16kk 210ms 235ms +11% 192ms -9%
64kk 642ms 748ms +16% 599ms -7%
100kk 990ms 1126ms +13% 901ms -9%

Some details about how I benchmark

Disclaimer: I am very new to benchmarking so please let me know if I did something wrong :slight_smile:

I run SLURM tasks running in a Singularity container with Rust 1.88 installed,
requesting 32G of memory and exclusive mode (#SBATCH --exclusive)

Framework

I use Criterion, see the benchmarking code for full details, but in short:

  • array sizes from 512 to 10kk for sequential, and from 250k to 100kk for parallel
  • use default criterion config with sample_size=20 and measurement_time=50s
  • fresh random array for every iteration
  • do not include random input generation time in measurement

Machine

Ubuntu 20.04.6 LTS

> uname -a
Linux xcs-vnode-1 5.4.0-208-generic #228-Ubuntu SMP Fri Feb 7 19:41:33 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
  • CPUs: 60
  • Threads per core: 1
  • CPU MHz: 2095.062
  • NUMA node0 CPUs: 0-59
> systemd-detect-virt 
kvm

Notes

I do not try to write a super fast parallel merge sort, I was just curious why different versions of the same algorithm perform this way.
The original project, if you are interested, came from trying to write a formally verified parallel merge sort written in Verus, you can see it here

Your code is very hard to reason about.
For example the distinction of whether parallel or sequential sorting is being used relies on a obscure threshold parameter, which is interpreted differently in all implementations of trait Sort.
Are you sure that the Sequential benchmark guarantees sequential execution?
Are you sure that the Parallel benchmark guarantees parallel execution?

Hello Schard!

Thank you for providing a quick response.
Threshold parameter is interpreted exactly the same in all three implementations and controls the length of the sub-array at which the code should stop spawning threads, and execute merge sort sequentially.

I am sure sequential case is sequential, as threshold is equal to the length of the array, and sequential mergesort is called straight away.
I am sure parallel is parallel, at least because parallel benches for the same algorithm perform x5 times faster than sequential ones

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.