Unable to optimize bounds checks

I was measuring the amount of clock cycles it takes for the following bubble sort on a thumbv7m-none-eabi target:

#[no_mangle]
#[inline(never)]
fn bubble_sort_rust(array: &mut [u32]) {
    let len = array.len();
    for i in 0..len - 1 {
        for j in 0..len - i - 1 {
            if array[j] > array[j + 1] {
                array.swap(j, j + 1);
            }
        }
    }
}

For an array of around 100 elements it takes 112900 cycles to sort the array. The following C implementation takes 66972 cycles for sorting:

void bubble_sort_c(uint32_t arr[], uint32_t n) {
  uint32_t i, j;
  for (i = 0; i < n - 1; i++) {
    for (j = 0; j < n - i - 1; j++)
      if (arr[j] > arr[j + 1]) {
        uint32_t temp = arr[j];
        arr[j] = arr[j + 1];
        arr[j + 1] = temp;
      }
  }
}

However, when I modify Rust code to the following (adding unreachable_uncheked):

#[no_mangle]
#[inline(never)]
fn bubble_sort_rust(array: &mut [u32]) {
    let len = array.len();

    for i in 0..len - 1 {
        for j in 0..len - i - 1 {
             if j + 1 >= len {
                unsafe { core::hint::unreachable_unchecked() };
            }
    
            if array[j] > array[j + 1] {
                array.swap(j, j + 1);
            }
        }
    }
}

sorting only takes 54805 cycles.

I have no idea why LLVM cannot optimize this without the hint. I'm really interested if this bubble sort can be optimized without the unsafe hint, since the first implementation seems like the one that most people would write.

(I know that bubble sort is not the best sorting algorithm, it is just for testing purposes)

2 Likes

One problem with this is that you have overflow if len == 0.

But adding assert!(len > 0) also doesn't remove all bounds checks.

1 Like

Indeed, I added the assert!(len > 0) and now it takes 92952 cycles instead of 112900 cycles. Adding the assert to the unsafe unreachable unchecked version doesn't change anything.

I think it's just too complicated to track the relationship between three different variables.

Try this:

fn bubble_sort_rust(array: &mut [u32]) {
    let len = array.len();
    for i in 0..len {
        for j in 0..len-1 {
            if j >= len - 1 - i {
                break;
            }
            if array[j] > array[j + 1] {
                array.swap(j, j + 1);
            }
        }
    }
}

This results in 95588 cycles.

I now tried to compile the C version with clang instead of gcc to see if this ha an impact and clang manages to compile it to an even faster implementation, with 49884 cycles.

So currently the C version compiled with clang is fastest (49884 cycles), followed by the Rust implementation with the unsafe unreachable_unchecked (54805 cycles).

I don't think it is too hard to check the relationship between three different variables. I guess compilers can do even more difficult things than this.

Here is a version using iterators. Looking at the generated assembly there is still an unnecessary bounds check, but only in the outer loop:

use std::cell::Cell;

fn bubble_sort_rust(array: &mut [u32]) {
    let array = Cell::from_mut(array).as_slice_of_cells();
    for len in (2..=array.len()).rev() {
        for window in array[..len].windows(2) {
            if window[0] > window[1] {
                window[0].swap(&window[1]);
            }
        }
    }
}

Tried it and it needs 62350 cycles. Thanks for all the ideas!

Here's another version with bound checks elided. The assembly is pretty weird though.

pub fn bubble_sort_rust(mut array: &mut [u32]) {
    while array.len() >= 2 {
        for i in 1..array.len() {
            if array[i-1] > array[i] {
                array.swap(i-1, i);
            }
        }
        let len = array.len();
        array = &mut array[..len-1];
    }
}
2 Likes

This is currently the fastest Rust AND without unsafe implementation: 53109 cycles! Still slower than C (with clang).

Here is the fastest implementation that I could think of. Note that it uses the unstable array_windows feature, and thus must be currently compiled on nightly.

#![feature(array_windows)]

use std::cell::Cell;

#[no_mangle]
pub fn sort_cell_windows_array(buf: &mut [u32]) {
    let array = Cell::from_mut(buf).as_slice_of_cells();
    let len = array.len();
    for i in (0..=len).rev() {
        for [left, right] in array[..i].array_windows::<2>() {
            if left.get() > right.get() {
                left.swap(right);
            }
        }
    }
}

In my benchmarks it is faster than any other implementation listed in this thread, including the unreachable_unchecked version (roughly 20-40% faster).

The following version acheives roughly the same performance, but has less explicit indices:

#![feature(array_windows)]

use std::cell::Cell;

#[no_mangle]
pub fn sort_split_array(buf: &mut [u32]) {
    let mut array = Cell::from_mut(buf).as_slice_of_cells();
    loop {
        for [left, right] in array.array_windows::<2>() {
            if left.get() > right.get() {
                left.swap(right);
            }
        }
        if let Some((_, new)) = array.split_last() {
            array = new;
        } else {
            return;
        }
    }
}

If you don't want to use nightly, then your next best bet is to use explicit indices (i, i+1) in the iteration over windows of size 2. This version is roughly 10-15% slower in my benchmarks. Using the <[u32]>::windows method causes a large (almost 50%) drop of performance due to unknown (to the compiler) windows length and extra bounds checks.

1 Like

Try this (Godbolt):

#[inline]
fn array_windows<T, const N: usize>(slice: &[T]) -> impl Iterator<Item = &[T; N]> + ExactSizeIterator + DoubleEndedIterator + Clone {
    slice.windows(N).map(|w| <&[T; N]>::try_from(w).unwrap())
}

pub fn sort_cell_windows_array(buf: &mut [u32]) {
    let array = Cell::from_mut(buf).as_slice_of_cells();
    for i in (0..=array.len()).rev() {
        for [left, right] in array_windows(&array[..i]) {
            if left.get() > right.get() {
                left.swap(right);
            }
        }
    }
}

I have successfully implemented my own array_windows() function before without it containing bounds checks. It might work for you too.

That is not my experience. I have tried using your function in my benchmarks, changing array_windows into your call. These are the results:

basic/naive             time:   [702.68 us 704.39 us 706.13 us]                        
basic/cell              time:   [664.82 us 666.16 us 667.66 us]                       
basic/unreachable       time:   [426.65 us 427.52 us 428.49 us]                              

windows/slice           time:   [447.24 us 448.15 us 449.08 us]                          
windows/array 🌃      time:   [452.67 us 453.57 us 454.61 us]                                 
windows/array_stable    time:   [499.43 us 500.31 us 501.28 us]                                 
windows/array_unsafe    time:   [447.50 us 449.81 us 452.29 us]                                 

split/indices           time:   [690.71 us 692.11 us 693.75 us]                          
split/array 🌃        time:   [432.66 us 433.25 us 433.92 us]                               
split/array_stable      time:   [519.03 us 520.86 us 522.74 us]                               
split/array_unsafe      time:   [498.97 us 500.13 us 501.38 us]                               
split/slice             time:   [517.97 us 519.11 us 520.35 us]       

Here the benchmark groups correspond to the different algorithms: the basic/naive and basic/unreachable are algorithms from the OP, windows/array is the first algorithm from my previous post, split/array is the second one. The ones with slice suffix use the <[u32]>::windows call, the ones with array_stable use your array_windows function, while array_unsafe use a similar function with a pointer cast:

#[inline]
fn array_windows_unsafe<T, const N: usize>(slice: &[T]) -> impl Iterator<Item=&[T; N]> + ExactSizeIterator + DoubleEndedIterator + Clone {
    slice.windows(N).map(|w| unsafe {
        &*(w as *const [T]).cast::<[T; N]>()
    })
}

The array_stable methods are consistently slower than the array ones, while for array_unsafe and slice methods the performance is different between explicitly subslicing and iteratively slice-shortening implementation.

I could not identify what is the specific reason for the speed difference. At the ASM level, the array methods actually do more branching. Both implementations perform a single panicking check: the subslicing array[..i], which is, surprisingly, not optimized away, even though i in (0..=len).rev().

How exactly are you compiling this? Note the flags in the Godbolt snippet: -C opt-level=3 -C target-cpu=native.

I am running cargo criterion. I assume it already does -O. I have tried running with -C target-cpu=native. Surprisingly, the basic tests run 20% slower, while all others run 10-30% faster. Anyway, the relative performance of implementation is the same.

If you check the generated assembly, are there bounds checks related to the length of the window (in my version)? It's not unimaginable that code organization itself, and not bounds checking, affects the relative performance of the code more heavily.

Indeed, I believe that the window size checks are optimized out, and the windows iterator is just harder to optimize, at least in this case.

Thanks for all the proposals! Just a question about assert.
Consider the following:

pub fn bubble_sort(array: &mut [u32]) {
    if array.is_empty() {
        return;
    }

    let len = array.len();
    for i in 0..len - 1 {
        let subslice = &mut array[..len - i];
        assert!(!subslice.is_empty());
        for j in 0..subslice.len() - 1 {
            if subslice[j + 1] < subslice[j] {
                subslice.swap(j, j + 1);
            }
        }
    }
}

Why is the assert helping the compiler with optimizing? The version with assert is more than 2 times faster than the version without the assert (on the nRF54820). Is it not something that needs to be mentioned in the docs or somewhere else that asserts can actually help the compiler with optimizations. Personally, I would not have thought about writing the assert in the first place.

Because it tells it something it can rely on after the assert, and thus moving the panic! outside the loop.

A loop with a panic inside is very hard to optimize because it has the extra exit condition of the panic -- it's not allowed to start running the next iteration if the current one is going to panic. If it knows that the range of values is such that the body never panics, all of a sudden it can unroll and vectorize and all that fun stuff.

Also, remember that wrapping cases are particularly confusing for everyone, human or machine -- much of what that assert is doing is saying "the subslice.len() - 1 won't wrap". But you can also do that by smarter choices of indexes that don't need to do the adjust-by-one things. For example, consider experimenting with something like this:

pub fn bubble_sort(array: &mut [u32]) {
    let len = array.len();
    for i in 0..len {
        let subslice = &mut array[..len - i];
        for j in 1..subslice.len() {
            if subslice[j - 1] > subslice[j] {
                subslice.swap(j - 1, j);
            }
        }
    }
}

That never does the len - 1 stuff, so works fine with empty things. And thus doesn't need the artificial "return if empty" step at the beginning either, in addition to not having any bounds checks.

4 Likes

Note that strictly speaking, the two pieces of code (with and without the assertion) are not equivalent. When a bounds check is performed, then the code will panic at the exact moment the invalid index is encountered. This includes potentially panicking upon the last iteration only, even if the presence of the panic can be foreseen by a human analyzing the code.

This is not only because the compiler isn't necessarily smart enough to hoist the bounds check out of the loop. The reason why this simply can't be done otherwise is that panicking is observable behavior, and a programmer debugging the code would be very upset if the panic didn't occur at the root cause of the invalid operation. It would be really frustrating if the code panicked with Invalid index: 42 even before the first iteration, when printing the induction variable would come back with 0.

In contrast, the assertion makes this "clairvoyant" line of reasoning explicit – here, if you have an empty slice, there is no circumstance in which any indexing into it will be valid, and by writing the assert, you explicitly ask the code to panic upon an empty slice even before the loop would be entered.

Now, an assert is known to panic (diverge) upon violating the precondition, and it inserts an explicit statement of unreachability into the generated code, too (since type checking depends on this property of the ! "never" type). This means that the compiler has the size information in a much more pre-digested manner at its disposal – it is basically trivial for it to optimize based on the assumption that if the subsequent code path is reached, then the length must be strictly positive.

In contrast, when there is no explicit assertion, the compiler would need to infer implicit size information from the loop body, potentially analyzing code across iterations, which is substantially more complicated, so maybe it just gives up.

5 Likes

Thanks for the explanation. I now understand why assert can help!