Clippy driving me to insanity, insisting on iterators

Sure, I don't mean you should be rude. I just wouldn't spend much time worrying about what C programmers (who presumably don't know Rust) think of my Rust code. :man_shrugging:

3 Likes

I guess I have mixed up a few problems here. 'problems' is perhaps too strong a word, rather 'puzzles' for the curious mind. None of this an actual problem for anything I'm currently doing with Rust.

It all starts with wanting to do in Rust what is done in C in this Intel article: Intel Developer Zone.

The puzzles that arise are:

  1. Yes, 49241 micro seconds of Rust is significantly slower that the 44361 micro seconds of C.

  2. On ARM it's a lot worse 795426 us for C, 2860273 for Rust! What is that about?

  3. The fastest Rust solution is not "Clippy approved". Clippy complains about "needless_range_loop". The question then is: Can this be done in a Clippy friendly manner?

  4. The Rust solution I came up with that Clippy likes is a nightmare, horribly verbose, obfuscated and impossible to understand. It's also much slower. I certainly don't like that style of Rust code. Likely that is my fault though, is there a better way?

  5. What if we forget the C example code and just produce a Rust equivalent functionality, no matter how it is done. How would a Rust guru do that?

2 Likes

That is a wicked thought. So wicked in fact that I just could not resist trying it.

I ran my loop_blocking.c through c2rust, changed the name the main it generated and called it at the end of my Rust code.

The result is stunning:

On x86_64:

$ RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo run  --release
...
MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    180838us
do_it_1:    183917us
do_it_2:    50366us
do_it_3:    67663us
do_it_4:    51039us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    185558us
do_it_1:    37898us

$ clang  -Wall -O3 -o loop_blocking loop_blocking.c
$ ./loop_blocking
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    186156us
do_it_1:    37656us

$ gcc  -Wall -O3 -o loop_blocking loop_blocking.c
$ ./loop_blocking
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    183282us
do_it_1:    42386us

Yay! Rust is on a par with Clang and handily beats GCC.

On ARM aarch64:

$ RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo run  --release
...
MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    3071584us
do_it_1:    3071831us
do_it_2:    2865702us
do_it_3:    3206544us
do_it_4:    2866663us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2615057us
do_it_1:    800239us

$ clang  -Wall -O3 -o loop_blocking loop_blocking.c
$ ./loop_blocking
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2674021us
do_it_1:    833169us

$ gcc  -Wall -O3 -o loop_blocking loop_blocking.c
$ ./loop_blocking
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2778817us
do_it_1:    826443us

Yay! Rust now beats both Clang and GCC!

What on Earth is going on here?

The c2rust generated code even look quite nice:

pub unsafe extern "C" fn do_it_1(mut a: *mut [libc::c_float; 4096],
                                 mut b: *mut [libc::c_float; 4096]) {
    let mut i: libc::c_int = 0 as libc::c_int;
    while i < 4096 as libc::c_int {
        let mut j: libc::c_int = 0 as libc::c_int;
        while j < 4096 as libc::c_int {
            let mut ii: libc::c_int = i;
            while ii < i + 64 as libc::c_int {
                let mut jj: libc::c_int = j;
                while jj < j + 64 as libc::c_int {
                    (*a.offset(ii as isize))[jj as usize] +=
                        (*b.offset(jj as isize))[ii as usize];
                    jj += 1
                }
                ii += 1
            }
            j += 64 as libc::c_int
        }
        i += 64 as libc::c_int
    };
}
8 Likes

I might be writing a lot of Rust code in C at this rate :slight_smile:

2 Likes

I don't have an ARM chip around. I would be interested in seeing, as the code goes from @ZiCog Rust to c2rust Rust, which change made the difference on ARM.

2 Likes

Looking at the disassembly on rust playground, it appears there are no bounds checks in this version -- there is no length information even associated with the two input pointers to bounds check against, and the indexing against the arrays pointed to are not bounds checked because of their statically-known lengths.

My bet would be on the missing bounds checks speeding the ARM version up significantly, as one of the arguments I've heard for Rust's heavy bounds checking is "on a modern speculative processor, it shouldn't matter", and I'm guessing that ARM's processors are less powerful at speculation than Intel's are.

5 Likes

I thought I would try and investigate that. Many posts back @kornel suggested using 'unchecked get'. What I found was the 'unchecked_index' crate. So now I yet another function, inspired by the while loops of the c2rust generated code and using 'unchecked_index'. Like so:

use unchecked_index::unchecked_index;

fn do_it_5 (a: &mut T, b: &T) {
    unsafe {
        let mut a = unchecked_index(a);
        let mut b = unchecked_index(b);
        let mut i: usize = 0;
        while i < MAX {
            let mut j: usize = 0;
            while j < MAX {
                let mut ii = i;
                while ii < i + BLOCK_SIZE {
                    let mut jj = j;
                    while jj < j + BLOCK_SIZE {
                        a[ii][jj] += b[jj][ii];
                        jj += 1
                    }
                    ii += 1
                }
                j += BLOCK_SIZE
            }
            i += BLOCK_SIZE
        }
    }
}

On the ARM I now have timing like so:

$ RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo run  --release
...
MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    3077746us
do_it_1:    3077237us
do_it_2:    2861694us
do_it_3:    3209412us
do_it_4:    2863419us
do_it_5:    2833007us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2599840us
do_it_1:    794296us

Well, if that 'unchecked_index' has removed the checks in 'do_it_5' it is still slow as hell compared the the c2rust generated code, the second 'do_it_1' listed above.

Something else is up with Rust on ARM.

Perhaps we could start with the fast c2rust code, try to rewrite it into more "normal rust style", and see what change causes the code to massively slow down.

9 Likes

How are those timings substantially different?

I think they mean this slice - Rust

2 Likes

They are not. That is the point. What you see there on ARM is that all the hand written Rust versions are approximately the same speed and very slow. That is to say it makes little difference if you use the 'loop blocking' optimization or not. Little difference if you use old fashioned loops or iterators.

But the version generated by c2rust is over three times faster! That is the last 'do_it_1' there.

The situation is very different on x86_64:

$ RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo run  --release
...
MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    187756us
do_it_1:    187849us
do_it_2:    50370us
do_it_3:    67714us
do_it_4:    51239us
do_it_5:    43677us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    187679us
do_it_1:    38153us

Here we see the hand written Rust using 'loop blocking' speeds things up by a factor between 2.7 and 4.3. That is 'do_it_2/3/4/5/6'

Mean while the c2rust generated code is 4.9 times faster.

Now, there was speculation that on the Raspberry Pi ARM there is less cache so this cache optimization may not help much. But the c2rust versions, with and without loop blocking show that not to be the case, loop blocking has a clear advantage.

I'm at a loss to know what is going on.

2 Likes

Ah, I ended up using the unchecked-index create. As that is what google first threw up when I searched for 'rust unchecked'.

I now have yet another version of our array addition function using slice::get_unchecked(), 'do_it_6()'. It makes no noticeable difference:

slice::get_unchecked() on x86_64:

MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    182807us
do_it_1:    184528us
do_it_2:    49985us
do_it_3:    68141us
do_it_4:    51147us
do_it_5:    44481us
do_it_6:    43788us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    184747us
do_it_1:    38933us

slice::get_unchecked() on ARM:

MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    3076866us
do_it_1:    3075138us
do_it_2:    2859130us
do_it_3:    3192932us
do_it_4:    2856175us
do_it_5:    2827841us
do_it_6:    2828779us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2584062us
do_it_1:    792847us

So now what? If we can rule out the cache differences as the reason for the poor performance on ARM and now we can rule out the bounds checking, why is all that hand crafted Rust so poorly performing, where the c2Rust conversion outperforms the original C?

1 Like

Are you changing BLOCKSIZE for the ARM? Also, you are accessing one of the arrays with a stride of BLOCKSIZE, which can be a problem, especially if the cache is set associative. Loop orderings can have a big impact on performance.

1 Like

No. I run the same code and sizes everywhere.

Thing is the same work done with the same array and block sizes, the same strides, done in C is much faster. The Rust code generated with c2rust from that runs about the same speed as the C. All of the hand written Rust versions run very much slower, even when using unsafe/unchecked.

$ gcc  -Wall -O3 -o loop_blocking loop_blocking.c
pi@debian-buster-64:~/loop_blocking $ ./loop_blocking
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2760229us
do_it_1:    793988us
$ RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo run  --release
...
MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    3071662us
do_it_1:    3072337us
do_it_2:    2858955us
do_it_3:    3194253us
do_it_4:    2857345us
do_it_5:    2831210us
do_it_6:    2828688us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2584869us
do_it_1:    792373us                              <----------- c2rust translation of the above C code.

As such I don't think cache is the problem here. C can do it fast, Rust can do it fast, just not any normal Rust code a human would write.

There is something very odd about Rust on ARM in the case.

The former in the comparison was c2rust code, no bounds checking; the latter looks about the same, also without bounds checking. They perform the same, but I'm not sure what that's supposed to prove, since they compile to almost identical assembly. What I would do is add back bounds checking to the c2rust version and see if that destroys performance. If it does, then bounds checking is a problem.

They do not perform the same. Far from it.

The c2rust generated code, without bounds checking, is the last 'do_it_1' in the list in my last post.

The hand written code, closest to that c2rust style, without bounds checking is do_it_5 and do_it_6. The former uses the 'unchecked_index' create the latter uses 'slice::get_unchecked'

As you see in the result above the hand written, unchecked, code is 4 times slower on ARM than the c2rust generated code.

Which assembly are you looking at? On x86_64 the hand written, unchecked Rust is much faster. Almost keeping up with the c2rust generated code.

The generated code specifies upfront the size of the slice and also does a direct, unsafe pointer offset calculation. This is equivalent to calling get_unchecked, no?

Your handwritten code does include size info, so surely it’s that unsafe offset calculation that’s providing the boost. You’ll have to work from the unsafe, generated code toward an idiomatic version and see where you encounter a slowdown.

2 Likes

I thought that is what was doing in going from the c2rust generated code to my hand made 'do_it_5' and 'do_it_6'.

I think I need some Rust guru help here.

You could make it easier to help you by keeping your code in a repo somewhere for people to follow along as you try new things.

1 Like

Done.

I'll try and find time to add some commentary about what is in there. Otherwise it's all described in this thread..

Pull requests welcome :slight_smile:

6 Likes