Clippy driving me to insanity, insisting on iterators

The compiler isn't eliding bounds, we just don't put any unnecessary bounds checks. You can't optimize better than no code at all :slight_smile:

6 Likes

Specifically, the implementation of Iterator::next for std::slice::Iter uses unsafe internally to avoid bounds checks (it still checks whether the iterator is_empty, but does not have to double-check whether the slice index is out of bounds, which would be the case with slicing).

3 Likes

Hmm... The aim of the game here is to do what the code on this page: https://software.intel.com/en-us/articles/loop-optimizations-where-blocks-are-required does.

In C all the memory of that 2d array is contiguous in memory. The whole point therefore of what I have in Rust is to have the 2d array in contiguous memory. If this :

let mut a = vec![[9f32; MAX]; MAX];
let mut b = vec![[10f32; MAX]; MAX]; 

does not do it then that is the first problem to be fixed.

And so, traversal order in Rust should be the nice diagram on that page.

I think my naive Rust does that, I'm not so sure the iterator versions do...

If anyone has suggestions that would be great.

I don't really want to have to use "unsafe". Then I would have all my C buddies saying "See told you so, when you have to do real work all that checking of Rust is pointless"

1 Like

The arrays are contiguous in memory; the blocks being processed are not.

This is (in theory) good for minimizing cache misses (what the link you provided is talking about), but bad for vectorization. However, since the code probably can't be vectorized anyway, making the blocks contiguous is most likely an exercise in futility. I was just speculating.

Another thing to keep in mind is that this kind of tweak is highly dependent on processor details -- most obviously the cache size. If BLOCK_SIZE is chosen for an Intel processor, you shouldn't expect it to be equally good for an ARM processor, or even another Intel processor with a different cache size; you'll want to adjust BLOCK_SIZE accordingly. If you choose the wrong BLOCK_SIZE, you might well end up making it slower than the naive code.

The opinions of tragically uninformed people should not impact your decision to use the best tool for the job.

6 Likes

OK. Good. That is the whole idea.

Yes indeed.

Indeed it is. I was already tweaking around with the block size, 64 seems about optimal on my x86_64 PC. No doubt it is different elsewhere.

I don't really want to tell my C programmer friends they are "tragically uninformed". After all the whole alias checking thing that Rust does is new technology that nobody has ever seen in a language before. As far as I can tell, tragically uninformed as I am.

Thing is, so far every C program I have recoded in Rust has ended up as fast as the C original or even a bit faster. Without resorting to any 'unsafe'. At least on x86, ARM is a bit of a disaster.

Then I come to this case which falls miserably far behind.

What do do?

Does anyone have a 'no holds barred, unsafe or not' solution to the problem in Rust that can match C?

1 Like

Pipe the C version through c2rust :slight_smile:

4 Likes

I'm sorry, I'm missing something fundamental here. Is the problem (1) 49241 > 44361 or (2) you don't like the style of the Rust code required to achieve this performance ?

2 Likes

Sure, I don't mean you should be rude. I just wouldn't spend much time worrying about what C programmers (who presumably don't know Rust) think of my Rust code. :man_shrugging:

3 Likes

I guess I have mixed up a few problems here. 'problems' is perhaps too strong a word, rather 'puzzles' for the curious mind. None of this an actual problem for anything I'm currently doing with Rust.

It all starts with wanting to do in Rust what is done in C in this Intel article: https://software.intel.com/en-us/articles/loop-optimizations-where-blocks-are-required.

The puzzles that arise are:

  1. Yes, 49241 micro seconds of Rust is significantly slower that the 44361 micro seconds of C.

  2. On ARM it's a lot worse 795426 us for C, 2860273 for Rust! What is that about?

  3. The fastest Rust solution is not "Clippy approved". Clippy complains about "needless_range_loop". The question then is: Can this be done in a Clippy friendly manner?

  4. The Rust solution I came up with that Clippy likes is a nightmare, horribly verbose, obfuscated and impossible to understand. It's also much slower. I certainly don't like that style of Rust code. Likely that is my fault though, is there a better way?

  5. What if we forget the C example code and just produce a Rust equivalent functionality, no matter how it is done. How would a Rust guru do that?

2 Likes

That is a wicked thought. So wicked in fact that I just could not resist trying it.

I ran my loop_blocking.c through c2rust, changed the name the main it generated and called it at the end of my Rust code.

The result is stunning:

On x86_64:

$ RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo run  --release
...
MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    180838us
do_it_1:    183917us
do_it_2:    50366us
do_it_3:    67663us
do_it_4:    51039us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    185558us
do_it_1:    37898us

$ clang  -Wall -O3 -o loop_blocking loop_blocking.c
$ ./loop_blocking
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    186156us
do_it_1:    37656us

$ gcc  -Wall -O3 -o loop_blocking loop_blocking.c
$ ./loop_blocking
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    183282us
do_it_1:    42386us

Yay! Rust is on a par with Clang and handily beats GCC.

On ARM aarch64:

$ RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo run  --release
...
MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    3071584us
do_it_1:    3071831us
do_it_2:    2865702us
do_it_3:    3206544us
do_it_4:    2866663us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2615057us
do_it_1:    800239us

$ clang  -Wall -O3 -o loop_blocking loop_blocking.c
$ ./loop_blocking
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2674021us
do_it_1:    833169us

$ gcc  -Wall -O3 -o loop_blocking loop_blocking.c
$ ./loop_blocking
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2778817us
do_it_1:    826443us

Yay! Rust now beats both Clang and GCC!

What on Earth is going on here?

The c2rust generated code even look quite nice:

pub unsafe extern "C" fn do_it_1(mut a: *mut [libc::c_float; 4096],
                                 mut b: *mut [libc::c_float; 4096]) {
    let mut i: libc::c_int = 0 as libc::c_int;
    while i < 4096 as libc::c_int {
        let mut j: libc::c_int = 0 as libc::c_int;
        while j < 4096 as libc::c_int {
            let mut ii: libc::c_int = i;
            while ii < i + 64 as libc::c_int {
                let mut jj: libc::c_int = j;
                while jj < j + 64 as libc::c_int {
                    (*a.offset(ii as isize))[jj as usize] +=
                        (*b.offset(jj as isize))[ii as usize];
                    jj += 1
                }
                ii += 1
            }
            j += 64 as libc::c_int
        }
        i += 64 as libc::c_int
    };
}
8 Likes

I might be writing a lot of Rust code in C at this rate :slight_smile:

2 Likes

I don't have an ARM chip around. I would be interested in seeing, as the code goes from @ZiCog Rust to c2rust Rust, which change made the difference on ARM.

2 Likes

Looking at the disassembly on rust playground, it appears there are no bounds checks in this version -- there is no length information even associated with the two input pointers to bounds check against, and the indexing against the arrays pointed to are not bounds checked because of their statically-known lengths.

My bet would be on the missing bounds checks speeding the ARM version up significantly, as one of the arguments I've heard for Rust's heavy bounds checking is "on a modern speculative processor, it shouldn't matter", and I'm guessing that ARM's processors are less powerful at speculation than Intel's are.

4 Likes

I thought I would try and investigate that. Many posts back @kornel suggested using 'unchecked get'. What I found was the 'unchecked_index' crate. So now I yet another function, inspired by the while loops of the c2rust generated code and using 'unchecked_index'. Like so:

use unchecked_index::unchecked_index;

fn do_it_5 (a: &mut T, b: &T) {
    unsafe {
        let mut a = unchecked_index(a);
        let mut b = unchecked_index(b);
        let mut i: usize = 0;
        while i < MAX {
            let mut j: usize = 0;
            while j < MAX {
                let mut ii = i;
                while ii < i + BLOCK_SIZE {
                    let mut jj = j;
                    while jj < j + BLOCK_SIZE {
                        a[ii][jj] += b[jj][ii];
                        jj += 1
                    }
                    ii += 1
                }
                j += BLOCK_SIZE
            }
            i += BLOCK_SIZE
        }
    }
}

On the ARM I now have timing like so:

$ RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo run  --release
...
MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    3077746us
do_it_1:    3077237us
do_it_2:    2861694us
do_it_3:    3209412us
do_it_4:    2863419us
do_it_5:    2833007us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2599840us
do_it_1:    794296us

Well, if that 'unchecked_index' has removed the checks in 'do_it_5' it is still slow as hell compared the the c2rust generated code, the second 'do_it_1' listed above.

Something else is up with Rust on ARM.

Perhaps we could start with the fast c2rust code, try to rewrite it into more "normal rust style", and see what change causes the code to massively slow down.

7 Likes

How are those timings substantially different?

I think they mean this https://doc.rust-lang.org/std/primitive.slice.html#method.get_unchecked

2 Likes

They are not. That is the point. What you see there on ARM is that all the hand written Rust versions are approximately the same speed and very slow. That is to say it makes little difference if you use the 'loop blocking' optimization or not. Little difference if you use old fashioned loops or iterators.

But the version generated by c2rust is over three times faster! That is the last 'do_it_1' there.

The situation is very different on x86_64:

$ RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo run  --release
...
MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    187756us
do_it_1:    187849us
do_it_2:    50370us
do_it_3:    67714us
do_it_4:    51239us
do_it_5:    43677us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    187679us
do_it_1:    38153us

Here we see the hand written Rust using 'loop blocking' speeds things up by a factor between 2.7 and 4.3. That is 'do_it_2/3/4/5/6'

Mean while the c2rust generated code is 4.9 times faster.

Now, there was speculation that on the Raspberry Pi ARM there is less cache so this cache optimization may not help much. But the c2rust versions, with and without loop blocking show that not to be the case, loop blocking has a clear advantage.

I'm at a loss to know what is going on.

2 Likes

Ah, I ended up using the unchecked-index create. As that is what google first threw up when I searched for 'rust unchecked'.

I now have yet another version of our array addition function using slice::get_unchecked(), 'do_it_6()'. It makes no noticeable difference:

slice::get_unchecked() on x86_64:

MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    182807us
do_it_1:    184528us
do_it_2:    49985us
do_it_3:    68141us
do_it_4:    51147us
do_it_5:    44481us
do_it_6:    43788us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    184747us
do_it_1:    38933us

slice::get_unchecked() on ARM:

MAX:        4096,
BLOCK_SIZE: 64,
do_it_0:    3076866us
do_it_1:    3075138us
do_it_2:    2859130us
do_it_3:    3192932us
do_it_4:    2856175us
do_it_5:    2827841us
do_it_6:    2828779us
MAX:        4096
BLOCK_SIZE: 64
do_it_0:    2584062us
do_it_1:    792847us

So now what? If we can rule out the cache differences as the reason for the poor performance on ARM and now we can rule out the bounds checking, why is all that hand crafted Rust so poorly performing, where the c2Rust conversion outperforms the original C?

1 Like

Are you changing BLOCKSIZE for the ARM? Also, you are accessing one of the arrays with a stride of BLOCKSIZE, which can be a problem, especially if the cache is set associative. Loop orderings can have a big impact on performance.

1 Like