Clippy driving me to insanity, insisting on iterators

I thought, by way of a quick challenge, I would convert some nested C loops to Rust. Here is the C:

for (int i = 0; i < MAX; i += BLOCK_SIZE) {
    for (int j = 0; j < MAX; j += BLOCK_SIZE) {
        for (int ii = i; ii < i + BLOCK_SIZE; ii++) {
            for (int jj = j; jj < j + BLOCK_SIZE; jj++) {
                a[ii][jj] += b[jj][ii];
            }
        }
    }
}

After a long conversation with clippy I end up with this in Rust:

fn do_it_3(a: &mut T, b: &T) {
    for s in 0..BLOCKS {
        for t in 0..BLOCKS {
            a.iter_mut()
                .skip(s * BLOCK_SIZE)
                .take(BLOCK_SIZE)
                .enumerate()
                .for_each(|(i, row)| {
                    row.iter_mut()
                        .skip(t * BLOCK_SIZE)
                        .take(BLOCK_SIZE)
                        .enumerate()
                        .for_each(|(j, y)| {
                            *y += b[j + t * BLOCK_SIZE][i + s * BLOCK_SIZE];
                        });
                });
        }
    }
}

I'm not happy. That is very long winded and unintelligible. It's also slow.

What actually is this?

Recently someone linked to the Intel article on writing cache friendly code: Intel Developer Zone. No big news for anyone now a days I guess. It includes the above simple C example that adds the elements two large arrays.

But I started to wonder how one would do that in Rust.

The closest I have come to writing something reasonable looking in Rust that almost runs as fast as the C is this:

for i in 00..(MAX / BLOCK_SIZE) {
    for j in 00..(MAX / BLOCK_SIZE) {
        for ii in (i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE) {
            for jj in (j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE) {
                a[ii][jj] += b[jj][ii];
            }
        }
    }
}

Of course clippy does not like this at all. But it performs almost like C:

C : 44361 micro seconds
Rust: 49241 micro seconds

Perhaps I'm missing a trick here. Any suggestions?

Code is in the playgrond: Rust Playground

4 Likes

This seems like a place where I'd slap a #[allow(clippy::whatever)] in with maybe a comment and move on.

13 Likes

Exactly. Linters are customizable for a reason. One of these reasons is: there are different kinds and categories of lints. Not every lint is about a potential correctness or performance problem ā€“ sometimes it's about style, but the developers of any particular linter still have to pick a default.

Clippy tends to encourage functional style and iterators. It doesn't, however, have the broader context of the code, nor the intelligence of a human programmer. If in your professional opinion it is wrong or overly pedantic, feel free to tell it to just shut up and call it a day. I wouldn't overthink it.

10 Likes

I'm quite happy to tell clippy to take a hike.

I am likely to over think it, being a newbie to Rust still, I like the idea of adhering to the 'house style' as told to me by clippy, and then there is the nagging thought that perhaps there is a nice way to do it with iterators that I am missing, perhaps clippy has lead me astray some how.

I was thinking there might be a better way by taking .chunks() instead of all that .skip().take() I have. I have not managed to make that idea work yet.

But then there is the second question in my post: How to achieve the same speed as C? We are nearly there but not quite.

Once again I find the performance on ARM to be terrible:

C:     794444 micro seconds
Rust: 2855917 micro seconds
2 Likes

Just figured I'd chip in here with step_by:

#[allow(clippy::all)]
fn do_it_4(a: &mut T, b: &T) {
    for i in (0..MAX).step_by(BLOCK_SIZE) {
        for j in (0..MAX).step_by(BLOCK_SIZE) {
            for ii in i..i + BLOCK_SIZE {
                for jj in j..j + BLOCK_SIZE {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }
}

It looks like the optimizer isn't able to elide the bounds checks, which is probably why it's slower than C. As for why it's much slower on ARM that could be more missed optimizations, or just the cost of worse / no branch prediction?

9 Likes

Wow thanks. I really like that. Looks much nicer than my hand made stepping thing. Performance seems to be pretty much the same.

In the playground here: Rust Playground

I'm used to my Rust translations being a bit slower that the C originals on ARM, where they sometimes come out ahead on x86-64. But that is example is extreme. It's four times slower on ARM.

2 Likes

Are you using equivalent ARM targets in C and Rust? e.g. soft/hard float in particular would make a huge difference.

2 Likes

No idea.

The Rust I build like so:

$ RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=cortex-a53" cargo run  --release

The C I build like so:

$ gcc -Wall -O3 -mtune=native -o loop_blocking loop_blocking.c

I have tried all kind of 'cortex-aXX' options, none make much difference, except those that trip over 'illegal instruction'

I'm running all this on Raspbian with a 64 bit kernel and the dh64-shell, so they are aarch64 binaries.

If there are any other Rapi specific options for Rust I cannot find what they are or how to set them in Rust.

1 Like

You can use -C target-cpu=native to let it be detected for Rust too. Maybe you could also try your C build with clang, to see if this is really a backend issue with LLVM optimization, vs. being a real language difference.

3 Likes

Bounds checks have a hidden cost: they prevent auto-vectorizarion. Try using iterators at least for the innermost loops. If not, try with unsafe unchecked get.

6 Likes

If I take trentj's code above clippy says:

warning: the loop variable `ii` is used to index `a`
  --> src/main.rs:75:23
   |
75 |             for ii in i..i + BLOCK_SIZE {
   |                       ^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(clippy::needless_range_loop)]` on by default
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_range_loop
help: consider using an iterator
   |
75 |             for (ii, <item>) in a.iter_mut().enumerate().skip(i).take(BLOCK_SIZE) {
   |                 ^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

And similarly for the inner jj loop.

OK, if I take Clippy's advice I end up with this:

    for i in (0..MAX).step_by(BLOCK_SIZE) {
        for j in (0..MAX).step_by(BLOCK_SIZE) {
            for (ii, item_ii) in a.iter_mut().enumerate().skip(i).take(BLOCK_SIZE) {
                for (jj, item_jj) in b.iter().enumerate().skip(j).take(BLOCK_SIZE) {
                    a[ii][jj] += b[jj][ii];
                }
            }
        }
    }

Well now of course that assignment in the middle is all wrong of course. With all kind of borrow errors. At which point I'm at a loss as to know how to continue.

As you see I did pull it off with the code I presented in my first post here. That took me half a day of fiddling to get working. Looking at it now I have no idea how I got there or how it how works.

Any suggestions would be very welcome.

Can this really be vectorized, given that one array is traversed in row major order and the other column major?

2 Likes

Good point. I don't think it can. Even if it could, I think you're iterating it in the wrong order because the blocks are not contiguous in memory. IOW you would want to be iterating like this:

1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1|2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2|3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4
...

But you're iterating like this instead:

1 1 1 1|2 2 2 2|3 3 3 3
1 1 1 1|2 2 2 2|3 3 3 3
1 1 1 1|2 2 2 2|3 3 3 3
1 1 1 1|2 2 2 2|3 3 3 3
4 4 4 4|5 5 5 5|6 6 6 6
...

Even without vectorization, modifying the code so one of the arrays is traversed in contiguous blocks might help. Or it might not because you have to do extra math to figure out the indices.

I suspect the difference between ARM and Intel has more to do with the cost of bounds checks being higher on ARM, which means if you eliminate the bounds checks (either with unsafe, or judicious use of iterators) you might get much better performance from that change alone.

2 Likes

Why is it easier for the compiler to elide bounds checking for iterators than loops? Iā€™d be interested in some reading on this.

2 Likes

The compiler isn't eliding bounds, we just don't put any unnecessary bounds checks. You can't optimize better than no code at all :slight_smile:

6 Likes

Specifically, the implementation of Iterator::next for std::slice::Iter uses unsafe internally to avoid bounds checks (it still checks whether the iterator is_empty, but does not have to double-check whether the slice index is out of bounds, which would be the case with slicing).

4 Likes

Hmm... The aim of the game here is to do what the code on this page: Intel Developer Zone does.

In C all the memory of that 2d array is contiguous in memory. The whole point therefore of what I have in Rust is to have the 2d array in contiguous memory. If this :

let mut a = vec![[9f32; MAX]; MAX];
let mut b = vec![[10f32; MAX]; MAX]; 

does not do it then that is the first problem to be fixed.

And so, traversal order in Rust should be the nice diagram on that page.

I think my naive Rust does that, I'm not so sure the iterator versions do...

If anyone has suggestions that would be great.

I don't really want to have to use "unsafe". Then I would have all my C buddies saying "See told you so, when you have to do real work all that checking of Rust is pointless"

1 Like

The arrays are contiguous in memory; the blocks being processed are not.

This is (in theory) good for minimizing cache misses (what the link you provided is talking about), but bad for vectorization. However, since the code probably can't be vectorized anyway, making the blocks contiguous is most likely an exercise in futility. I was just speculating.

Another thing to keep in mind is that this kind of tweak is highly dependent on processor details -- most obviously the cache size. If BLOCK_SIZE is chosen for an Intel processor, you shouldn't expect it to be equally good for an ARM processor, or even another Intel processor with a different cache size; you'll want to adjust BLOCK_SIZE accordingly. If you choose the wrong BLOCK_SIZE, you might well end up making it slower than the naive code.

The opinions of tragically uninformed people should not impact your decision to use the best tool for the job.

6 Likes

OK. Good. That is the whole idea.

Yes indeed.

Indeed it is. I was already tweaking around with the block size, 64 seems about optimal on my x86_64 PC. No doubt it is different elsewhere.

I don't really want to tell my C programmer friends they are "tragically uninformed". After all the whole alias checking thing that Rust does is new technology that nobody has ever seen in a language before. As far as I can tell, tragically uninformed as I am.

Thing is, so far every C program I have recoded in Rust has ended up as fast as the C original or even a bit faster. Without resorting to any 'unsafe'. At least on x86, ARM is a bit of a disaster.

Then I come to this case which falls miserably far behind.

What do do?

Does anyone have a 'no holds barred, unsafe or not' solution to the problem in Rust that can match C?

1 Like

Pipe the C version through c2rust :slight_smile:

5 Likes

I'm sorry, I'm missing something fundamental here. Is the problem (1) 49241 > 44361 or (2) you don't like the style of the Rust code required to achieve this performance ?

2 Likes