Performance of array access vs C

Not the best terminology, but I meant just calling from main the handwritten timing function that I originally posted. To be clear, my latest version of this is below, with a few alternatives commented out. On my computer, the "fast mode" runs the same speed as the "slow mode" in this timing function. Also, I originally had a type conversion using as usize, which had a significant impact on performance once I removed it.

The comments from @Hyeonu about bound checking being optimized out are consistent with my experience using the handwritten timing function, in particular getting the same performance when explicitly using get_unchecked. I'm not sure why the difference shows up in criterion but not in my timing function.

pub fn benchmark_bitmask(iterations: u64) {
    let zero: [u64; 64] = [0; 64];
    let val: u64 = 123456;
    
    let mut y = 0;
    let start_time = Instant::now();
    
    for i in 0..iterations {
        if (zero[(y % 64)] & val) == 0 {
            y += 1;
        }

        // "Fast" version, but performance is same
        // if (zero[y] & val) == 0 {
        //     y += 1;
        //     y %= 64;
        // }

        // Explicitly use get_unchecked.  Same performance.
        // if unsafe{ *zero.get_unchecked(y % 64) == 0 } {
        //     y += 1;
        // }

        // Original version with type conversion.  Almost twice as slow.
        // if (zero[(y % 64) as usize] & val) == 0 {
        //     y += 1;
        // }

    }

    let duration = start_time.elapsed();

    println!("y: {}", y);
    println!("benchmark: {} iterations in {}ms ({:.7}ns per iter)", iterations, duration.as_millis(), (duration.as_secs_f32() / (iterations as f32)) * 1e9);
}

I guess it depends on the number you are passing for iterations. Criterion is kind of smart about how it measures time. For example, you can remove the loop entirely and it will still provide a reasonable result (it runs the function under test with multiple iterations of its own).

I just want to confirm: are you seeing a meaningful difference between this and the more straightforward:

for i in 0..(MAX / BLOCK_SIZE) {

? If so, color me utterly surprised!

2 Likes

Ah but there is the magical thing marcianx. There is no difference. In fact the compiler produces exactly the same instructions for both. And brilliantly the compiler removes the duplicate code from my binary and calls the same thing for both. How about that?!

The difference is that the straightforward version causes clippy to complain like so:

warning: the loop variable `ii` is used to index `a`
  --> src/main.rs:47:23
   |
47 |             for ii in (i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE) {
   |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(clippy::needless_range_loop)]` on by default
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_range_loop
help: consider using an iterator

and requires "#[allow(clippy::needless_range_loop)]" to shut it up. As I was searching for "idiomatic Rust" I wanted clippy to be happy.

So the actual variants of the function are:

#[allow(clippy::needless_range_loop)]
fn transpose_2(a: &mut Array, b: &Array) {
    for i in 0..(MAX / BLOCK_SIZE) {
        for j in 0..(MAX / BLOCK_SIZE) {
            for ii in (i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE) {
                for jj in (j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE) {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }
}

fn transpose_3(a: &mut Array, b: &Array) {
    for (_, i) in (0..(MAX / BLOCK_SIZE)).enumerate() {
        for (_, j) in (0..(MAX / BLOCK_SIZE)).enumerate() {
            for (_, ii) in ((i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                for (_, jj) in ((j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE)).enumerate() {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }
}

Looks like the complication of using .enumerate() on the two outer loops is not actually required to keep clippy quite, but it's there for consistencies sake.

These two bits of code are not the same. The latter seems to have a bug: ii is going from 0 to BLOCK_SIZE due to the enumerate() and not from (i * BLOCK_SIZE) to ((i * BLOCK_SIZE) + BLOCK_SIZE) as intended (so yeah, it'll get a lot of cache hits there). You might as well have written just for ii in 0..BLOCKSIZE {..} for that matter. (Oops, index comes first; so I was wrong. See below.)

Honestly, if your library/binary is doing a lot of non-contiguous matrix indexing such as this, you could just do

#![allow(clippy::needless_range_loop)]

once in lib.rs/main.rs and call it a day. After, all clippy is supposed to help improve readability, not hinder it and perhaps it just has too many false positives in certain classes of code to be helpful there.

I know you've mentioned readability (in particular, your ability to read the code) many times before. There's nothing particularly wrong with optimizing for yourself and recognize clippy for what it is—a not-always-helpful but well-meaning backseat driver. Chances are good that there are plenty others who would find your preferred variation far more readable than the enumerate() workaround (if it worked), which obscures the meaning a little bit.

4 Likes

Oh yes they are.

In a PM parasyte mentioned they were the same and showed me how do do this:

$ cargo asm loop_blocking::transpose_2
loop_blocking::transpose_2:
 push    rbp
 push    r15
 push    r14
 push    r12
 ...
 A bunch of instructions here...
 ...
 pop     rbx
 pop     r12
 pop     r14
 pop     r15
 pop     rbp
 ret

If I then try that for transpose_3:

$ cargo asm loop_blocking::transpose_3
[ERROR]: could not find function at path "loop_blocking::transpose_3" in the generated assembly.
Is it one of the following functions?

  loop_blocking::transpose_0
  loop_blocking::transpose_1
  ...

If I comment each one out in turn and do that I can save the asm of both and diff it. The diffs are the same except for the name at the top!

I guess parasyte was inspired to dump that asm as he had noticed the run times were identical.

I think you are right about the clippy directive.

Remember that a Rust neophyte I was (still am) searching for hints and tips as to the right way to do things, the "idiomatic way" as they say, all over, the docs, this forum, around the net and of course clippy. Also I was having a terrible time with the rest of that code, like how to create those arrays in the first place, which is non-trivial in Rust.

I'm surprised that nobody here popped up to propose a solution like I now have when my thread discussing it was going on. They were all either slow, unsafe, or such ugly code I would never dream of having it in any code base of mine.

As it stands it's a toss up between the two versions, use the directive to keep clippy quite or add that little bit of verbosity. I could live happily with either of them.

Ah sorry, I was mistaken. You're absolutely right. I've been doing javascript for so many years that I expected the index to come second, not first. So, yes these two are equivalent as you asserted.

That's because it didn't occur to folks that clippy wasn't smart enough to give you another false positive here. :wink: All the enumerate() workaround does is fly under clippy's radar—it doesn't actually address clippy's original (misguided) concern. This workaround is only as good as clippy's inability to evolve to see through it in the future.

IMO, you might be able to file a clippy bug and tell them to not throw such a warning in such cases since the presence of b[jj][ii] clearly doesn't make clippy's suggestion ergonomically possible. And your problem might go away in a release or few.

In the meantime, I personally find transmute_2 much more readable than transmute_3. My 2c is to mute the warning.

2 Likes

If you think it is an issue worthy of bothering people with a bug report, where is the place to file it?

I'm all for the transpose_2 version as well.

Presumably here.

Ah yes, thanks. Now I just have to think of a sensible way to phrase the issue...

Without derailing the thread too much from its original question, I will also second the "just disable the lint" suggestion. It's ok to do this, particularly if you know better than clippy, or if the lint has some limitations (or bugs).

I ran into a similar false positive warning in a project I was exploring, where clippy complained about a boolean wrapped in a Mutex, even though it was used in coordination with a Condvar, identically to what is documented in the standard library. A closer look at the clippy docs warns that this lint has a known problem:

This lint cannot detect if the mutex is actually used for waiting before a critical section.

A reasonable conclusion is that the lint needs to be disabled in this instance.

You don't get the same benefit from the clippy docs in your specific case, but this is also a good opportunity to improve the documentation with new insight. This can now be addressed as a known issue, and that can be backed by what looks like a definitive example that is hard to improve upon.

1 Like

I'm not married to clippy. Just a Rust noob exploring the best ways to do things. Which of course includes listening to clippy's advice and accepting what cargo fmt does.

I'd rather everyone adopted common code patterns/idioms and formatting style even if they don't like some aspects of it rather than everyone going off and doing their own things for no useful purpose. For the sake of ease of readability comprehension by the global community at large.

There will be glaring exceptions, hopefully only occasionally, where one has to say "That's nuts, I won't do it".

1 Like