Performance of array access vs C

Hi,

I'm somewhat new to Rust and have a question about timing and performance vs C. I'm trying to benchmark some very simple code that does a lookup in an array and then applies the bitwise AND operator. Code is below. I've re-implemented the same benchmark function in C, and I seem to get about a factor of 4 difference in performance (elapsed time using a large iteration count such as 1E9).

I have also tried the comparison using just an array lookup, without the bitwise AND (see commented-out line), which results in much more similar performance (although C still slightly faster). This seems to indicate that the bitwise AND is performing significantly differently in Rust than C. I'm sort of scratching my head about this, as it's surprising that the Rust and C versions wouldn't compile to nearly identical code. However, I certainly could be missing something. Any insights would be appreciated.

Details:

  • OS: Linux
  • Rust 1.43.1 (running with --release)
  • gcc 10.1.0 (compiled with -O2)

Context: As a way to learn Rust, I wrote a chess engine following a guide for C. Overall, the program is around 30% slower than the C version, and I'm now going back through and trying to find parts that can be optimized. Based on profiling, a significant portion of the run time is spent in the "evaluation" function, which calculates a score for a given board position (during search, this function is called many times). A big part of the evaluation function involves using bitmasks to check for things like isolated pawns. So the example code below is representative of some of what actually goes on in the chess engine, and I suspect that any Rust-vs-C performance differences in this simple example are also manifesting in the performance of the overall chess engine.

Thanks,
John

pub fn benchmark_bitmask(iterations: u64) {
    let zero: [u64; 64] = [0; 64];
    let val: u64 = 123456;
    
    let mut y = 0;
    let start_time = Instant::now();
    
    for i in 0..iterations {
        if (zero[(y % 64) as usize] & val) == 0 {
        // if zero[(y % 64)] == 0 {
            y += 1;
        }
    }

    let duration = start_time.elapsed();
    println!("y: {}", y);
    println!("benchmark: {} iterations in {}ms ({:.7}us per iter)", iterations, duration.as_millis(), (duration.as_secs_f32() / (iterations as f32)) * 1e6);
}

And the C version:

typedef unsigned long long U64;
void benchmark_bitmask(U64 iterations) {
  U64 zero[64];
  for (int i=0; i<64; i++)
    zero[i] = 0;
  U64 val = 123456; 
  
  struct timeval tval_before, tval_after, tval_result;
  gettimeofday(&tval_before, NULL);

  long y = 0;

  for (U64 i=0; i<iterations; i++) {
    if ((zero[y % 64] & val) == 0) {
    /* if (zero[y % 64] == 0) { */
      y++;
    }
  }
  printf("y: %d\n", y);

  gettimeofday(&tval_after, NULL);
  timersub(&tval_after, &tval_before, &tval_result);  

  double elapsed = tval_result.tv_sec + tval_result.tv_usec / 1e6;

  printf("benchmark: %ld iterations in %lf s (%lf us per iter)\n", iterations, elapsed, elapsed/( (double) iterations) * 1e6);
}
2 Likes

I do not think you measure only bit-related operation here. You get a number from the array by index, so if Rust does index checking, you get penalty comparing to C code that does not do range checks

That is what I was thinking initially and is why I had also tried:

        if zero[(y % 64)] == 0 {

That was giving me a Rust-vs-C difference of closer to a factor of 2 (instead of 4), so it had me thinking that the bitwise AND must accounting for the difference to some degree. However, if I isolate the bitwise AND only:

        if i & val == 0 {

Then the Rust-vs-C timing is very close (around 20% difference), so that would not seem to be the culprit. It's odd to me that I'm getting factor 4 difference with the array+AND code, while only a factor of 2 with the array code only. If the performance difference is mostly array overhead, I'm not sure why those two cases should be different at all.

I'm really surprised how often ad hoc benchmarking and bespoke profiling gets used in these kinds of questions. To start with, you should use a benchmarking tool that is designed with statistical analysis in mind. criterion or bencher are good infrastructural pieces to start with. (I have also heard good things about llvm-mca but have not used it myself!)

Second tip is to know your code generator. Use cargo-asm or the Godbolt compiler explorer to inspect the machine code generated from your source.

The primary reason your Rust code is not performing as well as you expect is because it has bounds checks on each iteration of the loop. You could potentially remove bounds checks by iterating over the array instead of iterating over a range. This is where inspecting the generated machine code comes in handy.

The last thing I could mention is that this particular algorithm does not vectorize well. But with a transposed data structure, maybe it could?

10 Likes

Don't forget about overflow checks on y as well (depending on profile, sometimes not present in release)

1 Like

Thanks, I do need to look into these more. I started to try out criterion on the original code but then got discouraged when I found it it can't be used on binary crates. I may revisit later if I have more time. That still wouldn't address cross-language comparison, though. Since I essentially have two parallel implementations of the same program (the chess engine), it seems like I should be able to leverage that to track down what parts of the Rust code could be optimized by comparing the pieces against C, together with profiling. That is part of the reason for the ad hoc benchmarking.

I'm not sure exactly what you mean by bounds checking on each iteration. Are you referring to the array access? The iteration shown in the example code is not actually part of the motivating algorithm (which is more or less randomly accessing array elements), but array access is. I just added the iteration loop for timing purposes.

Thanks, good point. The only reason it's in there at all is to try and prevent the compiler from optimizing the whole thing away.

Bounds checking is checking that every time you access data from an array that the index you are using is within the bounds of the array. In your case your zero array is 64 elements long. If you did zero[67] in c, that would go ahead and return a piece of memory past the end of your array and your program would carry on as if nothing strange had happened. In rust, the default is to check to make sure that any time you access an element in an array, it makes sure that you are actually pointing to an element in the array. So in this case zero[67] in rust would cause the program to panic at runtime and tell you that you've accessed a value out of the bounds of the array. Rust checks this every time you access an element in an array, which is naturally slower than just assuming everything is fine.

The way around this in rust is to use get_unchecked or get_unchecked_mut. These function basically tell the compiler I know what I'm doing, and take my word for it that the index points inside the array. These functions are unsafe so you need to mark the code with an code block. In your code, it would look like this

if unsafe { zero.get_unchecked(y % 64) & val } == 0 {
      y++;
}

In theory, if the compiler can determine the call is always safe, it will eliminate the bounds check , but in this case it didn't, which is why you need to manually tell it to skip the check.

Or you could formulate your code so that rustc can check before entering a loop that each index you will generate is within bounds, in which case those checks precede the loop and are removed from each iteration. Per-access index and overflow checks can trigger panics at each iteration, which stops rustc from vectorizing your code. Doing the checks before the loop and formulating the algorithm to avoid potential overflow, such as using wrapping arithmetic, can in many cases enable rustc to vectorize your code.

1 Like

Nice, I didn't know about these. I tried swapping them out in my examples benchmarking code, and I don't get any difference. Maybe the compiler has already optimized out the bound check here. I will need to take a look at whether this is also the case in my real code.

However, I did come across something interesting: I realized that as usize wasn't needed in this example. Removing that resulted in a factor of 2 performance increase. That is why I thought that there was a difference coming from the bitwise AND (I had as usize in one case but not the other). This is a little worrisome, though, because I have extensive use of as usize in the real code, because I am indexing on things like u8 and enums, which require the conversion. I had tentatively assumed that such a conversion would not have any real performance penalty, but this doesn't seem to be the case.

What I have at this point is that both of the following versions perform the same for me, and each are about a factor of two different from C. This would seem to rule out bounds checking as the source of any performance difference.

        if (zero[(y % 64)] & val) == 0 {

or

        if unsafe { (zero.get_unchecked(y % 64) & val) == 0 } {

Well, now I'm thinking I probably need to go back to benchmarking on the real code, instead of this simplified example. I'm seeing some odd things in the performance of the C code depending on what values I initialize the array with. When I initialize as zero[i]=i, the performance in C all the sudden gets very bad (5 times slower than Rust). And with zero values, the C code runs faster for if ((zero[y % 64] & val) == 0) than it does on if ((i & val) == 0) (opposite of what I would expect). There may be some unique factors in how the compilers are handling this example code, which may not have anything to do with what's happening in the real code.

In contrived examples, compilers can sometimes optimize away significant portions of a calculation which makes benchmarking completely pointless. When benchmarking, you should use representative data that isn't just statically defined in the program.

You should avoid using unsafe unless it is absolutely necessary. There are several ways to prove to rustc that it doesn't need to add bounds checks without letting your guard down.

pub fn slow_mode(iterations: usize) -> usize {
    let zero: [u64; 64] = [0; 64];
    let val: u64 = 123456;

    let mut y = 0;

    for _ in 0..iterations {
        if (zero[y % 64] & val) == 0 {
            y += 1;
        }
    }

    y
}

pub fn fast_mode(iterations: usize) -> usize {
    let zero: [u64; 64] = [0; 64];
    let val: u64 = 123456;

    let mut y = 0;

    for _ in 0..iterations {
        if (zero[y] & val) == 0 {
            y += 1;
            y %= 64;
        }
    }

    y
}

slow_mode is identical to the code you originally offered. fast_mode is the same algorithm without bounds checks. You can see how I removed the bounds checks; the compiler can statically guarantee that the index will be within the array bounds because its value is clamped at the same time it is incremented.

Here's a benchmark:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use bench_bool::{fast_mode, slow_mode};

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fast mode", |b| b.iter(|| fast_mode(black_box(1000))));
    c.bench_function("slow mode", |b| b.iter(|| slow_mode(black_box(1000))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

And the results on my laptop (i7-7920HQ CPU @ 3.10GHz):

fast mode               time:   [669.13 ns 671.93 ns 674.93 ns]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

slow mode               time:   [2.5728 us 2.5898 us 2.6081 us]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Not a bad improvement, just by moving a single operation, huh!

Glad to hear you were interested in criterion! I've played with it enough to be somewhat comfortable with it. Definitely my tool of choice for writing and running microbenchmarks. I have another tip for you! I started following a pattern that is pretty common in the Rust ecosystem where the bulk of application logic lives in a library crate, and the binary only does minimal work; mostly just argument parsing and some top-level error handling. Then it calls into the library. This also lends itself well to testing (and benchmarking) because your integration tests and benchmarks can simply pull in the parts of your library that you are interested in testing.

Well, maybe it won't directly address cross-language testing, but Rust already has adequate support for FFI! You shouldn't have much difficulty linking your library crate with a C static lib. It's literally just an import, so your benchmarks would be calling the C function directly without any overhead.

4 Likes

@parasyte Is that a known trick to get the compiler to optimize out the bounds check in this case? Or is that just something to try to see if it works?

Along those lines are the various ways to hint to the compiler to eliminate bounds checks without unsafe written down anywhere?

1 Like

I would say it's pretty well-known. This is exactly what @TomP suggested earlier. Some other examples would be inserting assert!() in strategic locations. I am not aware of any documentation in this regard, just information scattered in various blogs and LLVM white papers and so on.

It may be worth reiterating that this particular algorithm cannot be easily vectorized; the index is conditional based upon previously visited elements, so it is strictly linear. One might choose to vectorize it by hand, computing both the branching and non-branches cases at the same time for multiple iterations, but that's another exercise, and who knows if it will actually work all that well!

In all honesty, I actually verified there was a difference in the generated code before I ran the benchmarks. When I observed what looked like a promising modification, I validated it. Just the good old scientific method, really; hypothesize, observe, experiment, record findings.

On closer inspection, I don't see any bounds checks in either function. fast_mode does allow the loop to unroll better, and eliminates a lot of unnecessary and instructions though. (y % 64 is transformed into and eax, 63)

2 Likes

Bound check is optimized out, so the differences are came from other parts.

2 Likes

Wow, parasyte, thank you. Your little trick there inspired me to revisit some Rust conversion of C code I was toying with a while ago that does a lot of array indexing.

My problem was that anything I tried either ran quite a bit slower than the original C or ended up using "unsafe" or caused clippy to complain about my style. I like to keep clippy happy. That is despite a long discussion about it and many suggestions in the thread: Clippy driving me to insanity, insisting on iterators

The problem was to translate this C code:

void do_it (int32_t a[MAX][MAX], int32_t b[][MAX]) {
    for (size_t i = 0; i < MAX; i += BLOCK_SIZE) {
        for (size_t j = 0; j < MAX; j += BLOCK_SIZE) {
            for (size_t ii = i; ii < i + BLOCK_SIZE; ii++) {
                for (size_t jj = j; jj < j + BLOCK_SIZE; jj++) {
                    a[ii][jj] += b[jj][ii];
                }
             }
        }
    }
}

Following your inspiration I now have this translation:

fn do_it (a: &mut Array, b: &Array) {
    for (_, i) in (0..(MAX / BLOCK_SIZE)).enumerate() {
        for (_, j) in (0..(MAX / BLOCK_SIZE)).enumerate() {
            for (_, ii) in ((i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                for (_, jj) in ((j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }
}

Finally, runs exactly the speed of the C, no "unsafe" and no clippy complains.

It has been a long road. All the suggested solutions can be seen here: https://github.com/ZiCog/loop_blocking

Given that nobody came up with a safe, clippy happy and performant solution in the original lengthy discussion I would say that such tricks to get optimal code out of Rust are pretty obscure.

3 Likes

@parasyte Thanks for all the tips and insights! I am looking into refactoring my original program into separate bin and library crates and making use of criterion.

I did some testing with your fast and slow mode examples: I can reproduce your results using criterion. However, using manual benchmarking, both versions perform the same for me. My timing is about 0.6 ns/iter (in line with criterion fast mode timing of 672 ns / 1000 iteration loops).

One thing that is not intuitive to me is why the compiler would be able to optimize out bounds checking (if that really is the source of difference in the criterion timings) when you do y %= 64; but not when the modulo appears directly in the array index, zero[y % 64].

What do you mean by "manual benchmarking?"

It isn't optimizing out bounds checking. It is not capable of optimizing out the modulo operator, which executes on every iteration unnecessarily. You can see this in the generated assembly, along with some loop unrolling.

1 Like

The compiler indeed optimize out the bound checking when the modulo appears directly in the array index. Check the assembly I posted above, there's no panic call.