Performance of array access vs C

Don't forget about overflow checks on y as well (depending on profile, sometimes not present in release)

1 Like

Thanks, I do need to look into these more. I started to try out criterion on the original code but then got discouraged when I found it it can't be used on binary crates. I may revisit later if I have more time. That still wouldn't address cross-language comparison, though. Since I essentially have two parallel implementations of the same program (the chess engine), it seems like I should be able to leverage that to track down what parts of the Rust code could be optimized by comparing the pieces against C, together with profiling. That is part of the reason for the ad hoc benchmarking.

I'm not sure exactly what you mean by bounds checking on each iteration. Are you referring to the array access? The iteration shown in the example code is not actually part of the motivating algorithm (which is more or less randomly accessing array elements), but array access is. I just added the iteration loop for timing purposes.

Thanks, good point. The only reason it's in there at all is to try and prevent the compiler from optimizing the whole thing away.

Bounds checking is checking that every time you access data from an array that the index you are using is within the bounds of the array. In your case your zero array is 64 elements long. If you did zero[67] in c, that would go ahead and return a piece of memory past the end of your array and your program would carry on as if nothing strange had happened. In rust, the default is to check to make sure that any time you access an element in an array, it makes sure that you are actually pointing to an element in the array. So in this case zero[67] in rust would cause the program to panic at runtime and tell you that you've accessed a value out of the bounds of the array. Rust checks this every time you access an element in an array, which is naturally slower than just assuming everything is fine.

The way around this in rust is to use get_unchecked or get_unchecked_mut. These function basically tell the compiler I know what I'm doing, and take my word for it that the index points inside the array. These functions are unsafe so you need to mark the code with an code block. In your code, it would look like this

if unsafe { zero.get_unchecked(y % 64) & val } == 0 {
      y++;
}

In theory, if the compiler can determine the call is always safe, it will eliminate the bounds check , but in this case it didn't, which is why you need to manually tell it to skip the check.

Or you could formulate your code so that rustc can check before entering a loop that each index you will generate is within bounds, in which case those checks precede the loop and are removed from each iteration. Per-access index and overflow checks can trigger panics at each iteration, which stops rustc from vectorizing your code. Doing the checks before the loop and formulating the algorithm to avoid potential overflow, such as using wrapping arithmetic, can in many cases enable rustc to vectorize your code.

1 Like

Nice, I didn't know about these. I tried swapping them out in my examples benchmarking code, and I don't get any difference. Maybe the compiler has already optimized out the bound check here. I will need to take a look at whether this is also the case in my real code.

However, I did come across something interesting: I realized that as usize wasn't needed in this example. Removing that resulted in a factor of 2 performance increase. That is why I thought that there was a difference coming from the bitwise AND (I had as usize in one case but not the other). This is a little worrisome, though, because I have extensive use of as usize in the real code, because I am indexing on things like u8 and enums, which require the conversion. I had tentatively assumed that such a conversion would not have any real performance penalty, but this doesn't seem to be the case.

What I have at this point is that both of the following versions perform the same for me, and each are about a factor of two different from C. This would seem to rule out bounds checking as the source of any performance difference.

        if (zero[(y % 64)] & val) == 0 {

or

        if unsafe { (zero.get_unchecked(y % 64) & val) == 0 } {

Well, now I'm thinking I probably need to go back to benchmarking on the real code, instead of this simplified example. I'm seeing some odd things in the performance of the C code depending on what values I initialize the array with. When I initialize as zero[i]=i, the performance in C all the sudden gets very bad (5 times slower than Rust). And with zero values, the C code runs faster for if ((zero[y % 64] & val) == 0) than it does on if ((i & val) == 0) (opposite of what I would expect). There may be some unique factors in how the compilers are handling this example code, which may not have anything to do with what's happening in the real code.

In contrived examples, compilers can sometimes optimize away significant portions of a calculation which makes benchmarking completely pointless. When benchmarking, you should use representative data that isn't just statically defined in the program.

You should avoid using unsafe unless it is absolutely necessary. There are several ways to prove to rustc that it doesn't need to add bounds checks without letting your guard down.

pub fn slow_mode(iterations: usize) -> usize {
    let zero: [u64; 64] = [0; 64];
    let val: u64 = 123456;

    let mut y = 0;

    for _ in 0..iterations {
        if (zero[y % 64] & val) == 0 {
            y += 1;
        }
    }

    y
}

pub fn fast_mode(iterations: usize) -> usize {
    let zero: [u64; 64] = [0; 64];
    let val: u64 = 123456;

    let mut y = 0;

    for _ in 0..iterations {
        if (zero[y] & val) == 0 {
            y += 1;
            y %= 64;
        }
    }

    y
}

slow_mode is identical to the code you originally offered. fast_mode is the same algorithm without bounds checks. You can see how I removed the bounds checks; the compiler can statically guarantee that the index will be within the array bounds because its value is clamped at the same time it is incremented.

Here's a benchmark:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use bench_bool::{fast_mode, slow_mode};

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fast mode", |b| b.iter(|| fast_mode(black_box(1000))));
    c.bench_function("slow mode", |b| b.iter(|| slow_mode(black_box(1000))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

And the results on my laptop (i7-7920HQ CPU @ 3.10GHz):

fast mode               time:   [669.13 ns 671.93 ns 674.93 ns]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

slow mode               time:   [2.5728 us 2.5898 us 2.6081 us]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Not a bad improvement, just by moving a single operation, huh!

Glad to hear you were interested in criterion! I've played with it enough to be somewhat comfortable with it. Definitely my tool of choice for writing and running microbenchmarks. I have another tip for you! I started following a pattern that is pretty common in the Rust ecosystem where the bulk of application logic lives in a library crate, and the binary only does minimal work; mostly just argument parsing and some top-level error handling. Then it calls into the library. This also lends itself well to testing (and benchmarking) because your integration tests and benchmarks can simply pull in the parts of your library that you are interested in testing.

Well, maybe it won't directly address cross-language testing, but Rust already has adequate support for FFI! You shouldn't have much difficulty linking your library crate with a C static lib. It's literally just an import, so your benchmarks would be calling the C function directly without any overhead.

4 Likes

@parasyte Is that a known trick to get the compiler to optimize out the bounds check in this case? Or is that just something to try to see if it works?

Along those lines are the various ways to hint to the compiler to eliminate bounds checks without unsafe written down anywhere?

1 Like

I would say it's pretty well-known. This is exactly what @TomP suggested earlier. Some other examples would be inserting assert!() in strategic locations. I am not aware of any documentation in this regard, just information scattered in various blogs and LLVM white papers and so on.

It may be worth reiterating that this particular algorithm cannot be easily vectorized; the index is conditional based upon previously visited elements, so it is strictly linear. One might choose to vectorize it by hand, computing both the branching and non-branches cases at the same time for multiple iterations, but that's another exercise, and who knows if it will actually work all that well!

In all honesty, I actually verified there was a difference in the generated code before I ran the benchmarks. When I observed what looked like a promising modification, I validated it. Just the good old scientific method, really; hypothesize, observe, experiment, record findings.

On closer inspection, I don't see any bounds checks in either function. fast_mode does allow the loop to unroll better, and eliminates a lot of unnecessary and instructions though. (y % 64 is transformed into and eax, 63)

2 Likes

Bound check is optimized out, so the differences are came from other parts.

2 Likes

Wow, parasyte, thank you. Your little trick there inspired me to revisit some Rust conversion of C code I was toying with a while ago that does a lot of array indexing.

My problem was that anything I tried either ran quite a bit slower than the original C or ended up using "unsafe" or caused clippy to complain about my style. I like to keep clippy happy. That is despite a long discussion about it and many suggestions in the thread: Clippy driving me to insanity, insisting on iterators

The problem was to translate this C code:

void do_it (int32_t a[MAX][MAX], int32_t b[][MAX]) {
    for (size_t i = 0; i < MAX; i += BLOCK_SIZE) {
        for (size_t j = 0; j < MAX; j += BLOCK_SIZE) {
            for (size_t ii = i; ii < i + BLOCK_SIZE; ii++) {
                for (size_t jj = j; jj < j + BLOCK_SIZE; jj++) {
                    a[ii][jj] += b[jj][ii];
                }
             }
        }
    }
}

Following your inspiration I now have this translation:

fn do_it (a: &mut Array, b: &Array) {
    for (_, i) in (0..(MAX / BLOCK_SIZE)).enumerate() {
        for (_, j) in (0..(MAX / BLOCK_SIZE)).enumerate() {
            for (_, ii) in ((i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                for (_, jj) in ((j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }
}

Finally, runs exactly the speed of the C, no "unsafe" and no clippy complains.

It has been a long road. All the suggested solutions can be seen here: https://github.com/ZiCog/loop_blocking

Given that nobody came up with a safe, clippy happy and performant solution in the original lengthy discussion I would say that such tricks to get optimal code out of Rust are pretty obscure.

3 Likes

@parasyte Thanks for all the tips and insights! I am looking into refactoring my original program into separate bin and library crates and making use of criterion.

I did some testing with your fast and slow mode examples: I can reproduce your results using criterion. However, using manual benchmarking, both versions perform the same for me. My timing is about 0.6 ns/iter (in line with criterion fast mode timing of 672 ns / 1000 iteration loops).

One thing that is not intuitive to me is why the compiler would be able to optimize out bounds checking (if that really is the source of difference in the criterion timings) when you do y %= 64; but not when the modulo appears directly in the array index, zero[y % 64].

What do you mean by "manual benchmarking?"

It isn't optimizing out bounds checking. It is not capable of optimizing out the modulo operator, which executes on every iteration unnecessarily. You can see this in the generated assembly, along with some loop unrolling.

1 Like

The compiler indeed optimize out the bound checking when the modulo appears directly in the array index. Check the assembly I posted above, there's no panic call.

Not the best terminology, but I meant just calling from main the handwritten timing function that I originally posted. To be clear, my latest version of this is below, with a few alternatives commented out. On my computer, the "fast mode" runs the same speed as the "slow mode" in this timing function. Also, I originally had a type conversion using as usize, which had a significant impact on performance once I removed it.

The comments from @Hyeonu about bound checking being optimized out are consistent with my experience using the handwritten timing function, in particular getting the same performance when explicitly using get_unchecked. I'm not sure why the difference shows up in criterion but not in my timing function.

pub fn benchmark_bitmask(iterations: u64) {
    let zero: [u64; 64] = [0; 64];
    let val: u64 = 123456;
    
    let mut y = 0;
    let start_time = Instant::now();
    
    for i in 0..iterations {
        if (zero[(y % 64)] & val) == 0 {
            y += 1;
        }

        // "Fast" version, but performance is same
        // if (zero[y] & val) == 0 {
        //     y += 1;
        //     y %= 64;
        // }

        // Explicitly use get_unchecked.  Same performance.
        // if unsafe{ *zero.get_unchecked(y % 64) == 0 } {
        //     y += 1;
        // }

        // Original version with type conversion.  Almost twice as slow.
        // if (zero[(y % 64) as usize] & val) == 0 {
        //     y += 1;
        // }

    }

    let duration = start_time.elapsed();

    println!("y: {}", y);
    println!("benchmark: {} iterations in {}ms ({:.7}ns per iter)", iterations, duration.as_millis(), (duration.as_secs_f32() / (iterations as f32)) * 1e9);
}

I guess it depends on the number you are passing for iterations. Criterion is kind of smart about how it measures time. For example, you can remove the loop entirely and it will still provide a reasonable result (it runs the function under test with multiple iterations of its own).

I just want to confirm: are you seeing a meaningful difference between this and the more straightforward:

for i in 0..(MAX / BLOCK_SIZE) {

? If so, color me utterly surprised!

2 Likes

Ah but there is the magical thing marcianx. There is no difference. In fact the compiler produces exactly the same instructions for both. And brilliantly the compiler removes the duplicate code from my binary and calls the same thing for both. How about that?!

The difference is that the straightforward version causes clippy to complain like so:

warning: the loop variable `ii` is used to index `a`
  --> src/main.rs:47:23
   |
47 |             for ii in (i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE) {
   |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(clippy::needless_range_loop)]` on by default
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_range_loop
help: consider using an iterator

and requires "#[allow(clippy::needless_range_loop)]" to shut it up. As I was searching for "idiomatic Rust" I wanted clippy to be happy.

So the actual variants of the function are:

#[allow(clippy::needless_range_loop)]
fn transpose_2(a: &mut Array, b: &Array) {
    for i in 0..(MAX / BLOCK_SIZE) {
        for j in 0..(MAX / BLOCK_SIZE) {
            for ii in (i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE) {
                for jj in (j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE) {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }
}

fn transpose_3(a: &mut Array, b: &Array) {
    for (_, i) in (0..(MAX / BLOCK_SIZE)).enumerate() {
        for (_, j) in (0..(MAX / BLOCK_SIZE)).enumerate() {
            for (_, ii) in ((i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                for (_, jj) in ((j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }
}

Looks like the complication of using .enumerate() on the two outer loops is not actually required to keep clippy quite, but it's there for consistencies sake.