Performance of array access vs C

Well, now I'm thinking I probably need to go back to benchmarking on the real code, instead of this simplified example. I'm seeing some odd things in the performance of the C code depending on what values I initialize the array with. When I initialize as zero[i]=i, the performance in C all the sudden gets very bad (5 times slower than Rust). And with zero values, the C code runs faster for if ((zero[y % 64] & val) == 0) than it does on if ((i & val) == 0) (opposite of what I would expect). There may be some unique factors in how the compilers are handling this example code, which may not have anything to do with what's happening in the real code.

In contrived examples, compilers can sometimes optimize away significant portions of a calculation which makes benchmarking completely pointless. When benchmarking, you should use representative data that isn't just statically defined in the program.

You should avoid using unsafe unless it is absolutely necessary. There are several ways to prove to rustc that it doesn't need to add bounds checks without letting your guard down.

pub fn slow_mode(iterations: usize) -> usize {
    let zero: [u64; 64] = [0; 64];
    let val: u64 = 123456;

    let mut y = 0;

    for _ in 0..iterations {
        if (zero[y % 64] & val) == 0 {
            y += 1;
        }
    }

    y
}

pub fn fast_mode(iterations: usize) -> usize {
    let zero: [u64; 64] = [0; 64];
    let val: u64 = 123456;

    let mut y = 0;

    for _ in 0..iterations {
        if (zero[y] & val) == 0 {
            y += 1;
            y %= 64;
        }
    }

    y
}

slow_mode is identical to the code you originally offered. fast_mode is the same algorithm without bounds checks. You can see how I removed the bounds checks; the compiler can statically guarantee that the index will be within the array bounds because its value is clamped at the same time it is incremented.

Here's a benchmark:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use bench_bool::{fast_mode, slow_mode};

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fast mode", |b| b.iter(|| fast_mode(black_box(1000))));
    c.bench_function("slow mode", |b| b.iter(|| slow_mode(black_box(1000))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

And the results on my laptop (i7-7920HQ CPU @ 3.10GHz):

fast mode               time:   [669.13 ns 671.93 ns 674.93 ns]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

slow mode               time:   [2.5728 us 2.5898 us 2.6081 us]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Not a bad improvement, just by moving a single operation, huh!

Glad to hear you were interested in criterion! I've played with it enough to be somewhat comfortable with it. Definitely my tool of choice for writing and running microbenchmarks. I have another tip for you! I started following a pattern that is pretty common in the Rust ecosystem where the bulk of application logic lives in a library crate, and the binary only does minimal work; mostly just argument parsing and some top-level error handling. Then it calls into the library. This also lends itself well to testing (and benchmarking) because your integration tests and benchmarks can simply pull in the parts of your library that you are interested in testing.

Well, maybe it won't directly address cross-language testing, but Rust already has adequate support for FFI! You shouldn't have much difficulty linking your library crate with a C static lib. It's literally just an import, so your benchmarks would be calling the C function directly without any overhead.

4 Likes

@parasyte Is that a known trick to get the compiler to optimize out the bounds check in this case? Or is that just something to try to see if it works?

Along those lines are the various ways to hint to the compiler to eliminate bounds checks without unsafe written down anywhere?

1 Like

I would say it's pretty well-known. This is exactly what @TomP suggested earlier. Some other examples would be inserting assert!() in strategic locations. I am not aware of any documentation in this regard, just information scattered in various blogs and LLVM white papers and so on.

It may be worth reiterating that this particular algorithm cannot be easily vectorized; the index is conditional based upon previously visited elements, so it is strictly linear. One might choose to vectorize it by hand, computing both the branching and non-branches cases at the same time for multiple iterations, but that's another exercise, and who knows if it will actually work all that well!

In all honesty, I actually verified there was a difference in the generated code before I ran the benchmarks. When I observed what looked like a promising modification, I validated it. Just the good old scientific method, really; hypothesize, observe, experiment, record findings.

On closer inspection, I don't see any bounds checks in either function. fast_mode does allow the loop to unroll better, and eliminates a lot of unnecessary and instructions though. (y % 64 is transformed into and eax, 63)

2 Likes

Bound check is optimized out, so the differences are came from other parts.

2 Likes

Wow, parasyte, thank you. Your little trick there inspired me to revisit some Rust conversion of C code I was toying with a while ago that does a lot of array indexing.

My problem was that anything I tried either ran quite a bit slower than the original C or ended up using "unsafe" or caused clippy to complain about my style. I like to keep clippy happy. That is despite a long discussion about it and many suggestions in the thread: Clippy driving me to insanity, insisting on iterators

The problem was to translate this C code:

void do_it (int32_t a[MAX][MAX], int32_t b[][MAX]) {
    for (size_t i = 0; i < MAX; i += BLOCK_SIZE) {
        for (size_t j = 0; j < MAX; j += BLOCK_SIZE) {
            for (size_t ii = i; ii < i + BLOCK_SIZE; ii++) {
                for (size_t jj = j; jj < j + BLOCK_SIZE; jj++) {
                    a[ii][jj] += b[jj][ii];
                }
             }
        }
    }
}

Following your inspiration I now have this translation:

fn do_it (a: &mut Array, b: &Array) {
    for (_, i) in (0..(MAX / BLOCK_SIZE)).enumerate() {
        for (_, j) in (0..(MAX / BLOCK_SIZE)).enumerate() {
            for (_, ii) in ((i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                for (_, jj) in ((j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }
}

Finally, runs exactly the speed of the C, no "unsafe" and no clippy complains.

It has been a long road. All the suggested solutions can be seen here: https://github.com/ZiCog/loop_blocking

Given that nobody came up with a safe, clippy happy and performant solution in the original lengthy discussion I would say that such tricks to get optimal code out of Rust are pretty obscure.

3 Likes

@parasyte Thanks for all the tips and insights! I am looking into refactoring my original program into separate bin and library crates and making use of criterion.

I did some testing with your fast and slow mode examples: I can reproduce your results using criterion. However, using manual benchmarking, both versions perform the same for me. My timing is about 0.6 ns/iter (in line with criterion fast mode timing of 672 ns / 1000 iteration loops).

One thing that is not intuitive to me is why the compiler would be able to optimize out bounds checking (if that really is the source of difference in the criterion timings) when you do y %= 64; but not when the modulo appears directly in the array index, zero[y % 64].

What do you mean by "manual benchmarking?"

It isn't optimizing out bounds checking. It is not capable of optimizing out the modulo operator, which executes on every iteration unnecessarily. You can see this in the generated assembly, along with some loop unrolling.

1 Like

The compiler indeed optimize out the bound checking when the modulo appears directly in the array index. Check the assembly I posted above, there's no panic call.

Not the best terminology, but I meant just calling from main the handwritten timing function that I originally posted. To be clear, my latest version of this is below, with a few alternatives commented out. On my computer, the "fast mode" runs the same speed as the "slow mode" in this timing function. Also, I originally had a type conversion using as usize, which had a significant impact on performance once I removed it.

The comments from @Hyeonu about bound checking being optimized out are consistent with my experience using the handwritten timing function, in particular getting the same performance when explicitly using get_unchecked. I'm not sure why the difference shows up in criterion but not in my timing function.

pub fn benchmark_bitmask(iterations: u64) {
    let zero: [u64; 64] = [0; 64];
    let val: u64 = 123456;
    
    let mut y = 0;
    let start_time = Instant::now();
    
    for i in 0..iterations {
        if (zero[(y % 64)] & val) == 0 {
            y += 1;
        }

        // "Fast" version, but performance is same
        // if (zero[y] & val) == 0 {
        //     y += 1;
        //     y %= 64;
        // }

        // Explicitly use get_unchecked.  Same performance.
        // if unsafe{ *zero.get_unchecked(y % 64) == 0 } {
        //     y += 1;
        // }

        // Original version with type conversion.  Almost twice as slow.
        // if (zero[(y % 64) as usize] & val) == 0 {
        //     y += 1;
        // }

    }

    let duration = start_time.elapsed();

    println!("y: {}", y);
    println!("benchmark: {} iterations in {}ms ({:.7}ns per iter)", iterations, duration.as_millis(), (duration.as_secs_f32() / (iterations as f32)) * 1e9);
}

I guess it depends on the number you are passing for iterations. Criterion is kind of smart about how it measures time. For example, you can remove the loop entirely and it will still provide a reasonable result (it runs the function under test with multiple iterations of its own).

I just want to confirm: are you seeing a meaningful difference between this and the more straightforward:

for i in 0..(MAX / BLOCK_SIZE) {

? If so, color me utterly surprised!

2 Likes

Ah but there is the magical thing marcianx. There is no difference. In fact the compiler produces exactly the same instructions for both. And brilliantly the compiler removes the duplicate code from my binary and calls the same thing for both. How about that?!

The difference is that the straightforward version causes clippy to complain like so:

warning: the loop variable `ii` is used to index `a`
  --> src/main.rs:47:23
   |
47 |             for ii in (i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE) {
   |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(clippy::needless_range_loop)]` on by default
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_range_loop
help: consider using an iterator

and requires "#[allow(clippy::needless_range_loop)]" to shut it up. As I was searching for "idiomatic Rust" I wanted clippy to be happy.

So the actual variants of the function are:

#[allow(clippy::needless_range_loop)]
fn transpose_2(a: &mut Array, b: &Array) {
    for i in 0..(MAX / BLOCK_SIZE) {
        for j in 0..(MAX / BLOCK_SIZE) {
            for ii in (i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE) {
                for jj in (j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE) {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }
}

fn transpose_3(a: &mut Array, b: &Array) {
    for (_, i) in (0..(MAX / BLOCK_SIZE)).enumerate() {
        for (_, j) in (0..(MAX / BLOCK_SIZE)).enumerate() {
            for (_, ii) in ((i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                for (_, jj) in ((j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }
}

Looks like the complication of using .enumerate() on the two outer loops is not actually required to keep clippy quite, but it's there for consistencies sake.

These two bits of code are not the same. The latter seems to have a bug: ii is going from 0 to BLOCK_SIZE due to the enumerate() and not from (i * BLOCK_SIZE) to ((i * BLOCK_SIZE) + BLOCK_SIZE) as intended (so yeah, it'll get a lot of cache hits there). You might as well have written just for ii in 0..BLOCKSIZE {..} for that matter. (Oops, index comes first; so I was wrong. See below.)

Honestly, if your library/binary is doing a lot of non-contiguous matrix indexing such as this, you could just do

#![allow(clippy::needless_range_loop)]

once in lib.rs/main.rs and call it a day. After, all clippy is supposed to help improve readability, not hinder it and perhaps it just has too many false positives in certain classes of code to be helpful there.

I know you've mentioned readability (in particular, your ability to read the code) many times before. There's nothing particularly wrong with optimizing for yourself and recognize clippy for what it is—a not-always-helpful but well-meaning backseat driver. Chances are good that there are plenty others who would find your preferred variation far more readable than the enumerate() workaround (if it worked), which obscures the meaning a little bit.

4 Likes

Oh yes they are.

In a PM parasyte mentioned they were the same and showed me how do do this:

$ cargo asm loop_blocking::transpose_2
loop_blocking::transpose_2:
 push    rbp
 push    r15
 push    r14
 push    r12
 ...
 A bunch of instructions here...
 ...
 pop     rbx
 pop     r12
 pop     r14
 pop     r15
 pop     rbp
 ret

If I then try that for transpose_3:

$ cargo asm loop_blocking::transpose_3
[ERROR]: could not find function at path "loop_blocking::transpose_3" in the generated assembly.
Is it one of the following functions?

  loop_blocking::transpose_0
  loop_blocking::transpose_1
  ...

If I comment each one out in turn and do that I can save the asm of both and diff it. The diffs are the same except for the name at the top!

I guess parasyte was inspired to dump that asm as he had noticed the run times were identical.

I think you are right about the clippy directive.

Remember that a Rust neophyte I was (still am) searching for hints and tips as to the right way to do things, the "idiomatic way" as they say, all over, the docs, this forum, around the net and of course clippy. Also I was having a terrible time with the rest of that code, like how to create those arrays in the first place, which is non-trivial in Rust.

I'm surprised that nobody here popped up to propose a solution like I now have when my thread discussing it was going on. They were all either slow, unsafe, or such ugly code I would never dream of having it in any code base of mine.

As it stands it's a toss up between the two versions, use the directive to keep clippy quite or add that little bit of verbosity. I could live happily with either of them.

Ah sorry, I was mistaken. You're absolutely right. I've been doing javascript for so many years that I expected the index to come second, not first. So, yes these two are equivalent as you asserted.

That's because it didn't occur to folks that clippy wasn't smart enough to give you another false positive here. :wink: All the enumerate() workaround does is fly under clippy's radar—it doesn't actually address clippy's original (misguided) concern. This workaround is only as good as clippy's inability to evolve to see through it in the future.

IMO, you might be able to file a clippy bug and tell them to not throw such a warning in such cases since the presence of b[jj][ii] clearly doesn't make clippy's suggestion ergonomically possible. And your problem might go away in a release or few.

In the meantime, I personally find transmute_2 much more readable than transmute_3. My 2c is to mute the warning.

2 Likes

If you think it is an issue worthy of bothering people with a bug report, where is the place to file it?

I'm all for the transpose_2 version as well.

Presumably here.

Ah yes, thanks. Now I just have to think of a sensible way to phrase the issue...