Why are cartesian iterators slower than nested fors?

Hi, I'm making some minimal HPC code examples. In this case I'm facing a problem when implement Floyd-Warshall algorithm using a simple vector (this structure is not relevant for this question). I've two equivalent blocks of code:

Using nested fors:

fn floyd(graph: &mut Vec<u32>, n: usize) {
    for k in 0..n {
        for i in 0..n {
            for j in 0..n {
                let sum = graph[i * n + k] + graph[k * n + j];
                if sum < graph[i * n + j] {
                    graph[i * n + j] = sum;
                }
            }
        }
    }
}

Using Iterators with iproduct:

fn floyd(graph: &mut Vec<u32>, n: usize) {
    let mut i0;
    for (k, i, j) in iproduct!(0..n, 0..n, 0..n) {
        i0 = i * n + j;
        let sum = graph[i * n + k] + graph[k * n + j];
        if sum < graph[i0] {
            graph[i0] = sum;
        }
    }
}

Both do the same and work as expected but the option with iproduct is 3x slower than the 3 nested fors. What's happening here? It should, at least, take the same time doing the iterations, shouldn't it?

Is the lack of perfomance due to laziness?

Any kind of help would be appreciated, thanks in advace

I suspect this is a problem with external iteration -- every call to next() has to check the current state of each of the product's nesting levels to see how to advance the iterator. The Product type does implement an override for fold though, so this might perform better with internal iteration:

iproduct!(0..n, 0..n, 0..n).for_each(|k, i, j| {
    // ...
});
1 Like

Thanks for your answer! it's effectively faster, unfortunatelly iterators are still slower than nested fors. In a 2048 * 2048 matrix there's a difference of 3 seconds between both solutions.

Sadly, the optimizer fails to remove bounds checks:

That iterator implements fold and for_each in a way so that you can regain some of that performance.

The basic difference between push/pull iteration is explored in this blog post https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb ; take the title with a grain of salt.

The real title

Rust’s iterators [were] inefficient, and here’s what we can do about it.

is a lot better than the original (joke?) title that the URL still has.

In my experience:

  1. Iterators are slower than nested for loops.

  2. I you create an actual 2 by 2 array and index it with something like "b[j][i]" that is significantly faster than synthesizing one out of a 1 dimensional vector and indexing it with "graph[i * n + k]".

That leaves the issue of creating such a 2D array if it is large. One suggestion that came up here was:

type Array = [[i32; MAX]; MAX];
...
let mut a = unsafe {
        let layout = std::alloc::Layout::new::<Array>();
        let ptr = std::alloc::alloc_zeroed(layout) as *mut Array;
        Box::from_raw(ptr)
    };

My code demonstrating this is a little repo here:

3 Likes

Yes, it's a big win when the "row stride" or equivalent is a compile time constant, which it is which a known size array.

Since nalgebra has lots of features compile time and type-level size information implemented for its matricies, it should be able to support the same thing.

As @bluss already said, there are crates to handle matrices and graphs, and there are valid reasons: handling them in a both fast and safe way is not trivial. Unfortunately, the only crate I see that implement your algorithm does not seem to be actively maintained, and maybe it could be even slower than your implementation.

Said that, if you really don't want to use an external crate and obtain the best performance, you need to go unsafe -- which is totally not recommended if you just started with Rust.

In any case, starting from what @kornel linked (there is a typo -- matrix size is n * n, not n * n * n), you can use unsafe to access data without bound checking: godbolt link.
In this case is trivial to verify that the operations are sound, because you are already operating on a slice of known size. Moreover, you can use wrapping_ operations, because you know that they won't ever wrap around. I would also suggest to document why each unsafe operation is sound, your future self will thank you.

Just to emphasize: generally I strongly suggest avoiding using unsafe for these purposes, because there is a high probability that someone already made a better optimization in a specialized crate. In this case, the situation looks a bit different, and I would suggest to put all your high-performant and unsafe routines in a separate crate, in order to ease the maintainability. In any case, I think that you should at least use ndarray or petgraph to avoid using a raw Vec and make your life easier. :blush:

1 Like

Unfortunately avoiding bounds checking like that does not improve the performance one bit over Genarito's nested for loop example in the first post.

What it does is add a tremendous amount of obfuscation. We have this:

                let sum = unsafe {
                    graph.get_unchecked(i.wrapping_mul(n).wrapping_add(k))
                        + graph.get_unchecked(k.wrapping_mul(n).wrapping_add(j))
                };
                let cell = unsafe { graph.get_unchecked_mut(i.wrapping_mul(n).wrapping_add(j)) };
                if sum < *cell {
                    *cell = sum;
                }

instead of this:

                let sum = graph[i * n + k] + graph[k * n + j];
                if sum < graph[i * n + j] {
                    graph[i * n + j] = sum;
                }

I conclude that neither "unsafe" nor obfuscation is required here. I feel it it would be very sad if it was.

For the record, on my Win 10/Debian WSL PC Genarito's nested for loop solution runs in 4.9 seconds, his iproduct solution in 36 seconds and your unsafe verison in 4.9 seconds.
Code is here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=26b394c10b56b4d2fee8504f7738a007

I keep finding that with the current state of the art functional style Rust with iterators and the rest is slower than just writing Rust as if it was C. As well as generally obfuscating what is going on.

Which is fine by me.

2 Likes

ndarray has some good implementations for mulitidimensional array traversal, but it doesn't do very well with using index notation to access elements.

With it we have to think just like when using Python numpy - how do express this as a higher level operation and not with indices?

I linked godbolt because the difference can be easily seen: without bound checking the compiler is able to produce vectorized code that, otherwise, cannot do.

Have you tried to compile with native target cpu? It is obvious that using unsafe and wrapping operations is more verbose, it have to...

I have now. Going native slows it down by about 10%. I have never seen doing this result in any benefit, not on x86_64 or aarch64.

Like so:

$ RUSTFLAGS="-C target-cpu=native" cargo build --release
...
$ time ./target/release/floyd-warshall

real    0m5.399s
user    0m5.297s
sys     0m0.047s

$ cargo build --release
...
$ time ./target/release/floyd-warshall

real    0m4.983s
user    0m4.953s
sys     0m0.031s

That algorithm hops around the array so wildly I can't imagine it can be vectorized.

This is present in the code without bound checks:

       vpbroadcastd    ymm1, dword ptr [r10]
        mov     rcx, r8
        mov     rdx, qword ptr [rsp]
        xor     ebp, ebp
        mov     rdi, qword ptr [rsp + 16]
.LBB0_26:
        vpaddd  ymm2, ymm1, ymmword ptr [r12 + rdx - 96]
        vpmaxud ymm3, ymm2, ymmword ptr [r12 + rcx]
        vpcmpeqd        ymm3, ymm2, ymm3
        vpxor   ymm3, ymm3, ymm0
        vpmaskmovd      ymmword ptr [r12 + rcx], ymm3, ymm2
        vpaddd  ymm2, ymm1, ymmword ptr [r12 + rdx - 64]
        vpmaxud ymm3, ymm2, ymmword ptr [r12 + rcx + 32]
        vpcmpeqd        ymm3, ymm2, ymm3
        vpxor   ymm3, ymm3, ymm0
        vpmaskmovd      ymmword ptr [r12 + rcx + 32], ymm3, ymm2
        vpaddd  ymm2, ymm1, ymmword ptr [r12 + rdx - 32]
        vpmaxud ymm3, ymm2, ymmword ptr [r12 + rcx + 64]
        vpcmpeqd        ymm3, ymm2, ymm3
        vpxor   ymm3, ymm3, ymm0
        vpmaskmovd      ymmword ptr [r12 + rcx + 64], ymm3, ymm2
        vpaddd  ymm2, ymm1, ymmword ptr [r12 + rdx]
        vpmaxud ymm3, ymm2, ymmword ptr [r12 + rcx + 96]
        vpcmpeqd        ymm3, ymm2, ymm3
        vpxor   ymm3, ymm3, ymm0
        vpmaskmovd      ymmword ptr [r12 + rcx + 96], ymm3, ymm2
        add     rbp, 32
        sub     rdx, -128
        sub     rcx, -128
        add     rdi, -4
        jne     .LBB0_26
        cmp     qword ptr [rsp + 24], 0
        je      .LBB0_30
.LBB0_28:
        vpbroadcastd    ymm1, dword ptr [r10]
        lea     rcx, [rbx + 4*rbp]
        lea     rdx, [rax + 4*rbp]
        xor     edi, edi
.LBB0_29:
        vpaddd  ymm2, ymm1, ymmword ptr [rdx + rdi]
        vpmaxud ymm3, ymm2, ymmword ptr [rcx + rdi]
        vpcmpeqd        ymm3, ymm2, ymm3
        vpxor   ymm3, ymm3, ymm0
        vpmaskmovd      ymmword ptr [rcx + rdi], ymm3, ymm2
        add     rdi, 32
        cmp     r15, rdi
        jne     .LBB0_29

which is not available in the code with bound checks. And if you look at the generated assembly, you can also discover that the code is much longer, and given that the algorithm written in this way is not cache friendly, it is understandable why this can be slower.

But, as I already said, working with matrices and graphs is hard, if LAPACK is still used nowadays there are reason (and remember that it is written in Fortran, not C, and this for reasons as well). If you want to write a good implementation of this (and not a trivial one like we are discussing), it is probably necessary to consider a better memory layout, using Z-order curves for instance. Which brings to the fact that it is generally a good idea to use crates for handling matrices and graphs.

It is important to always benchmark the code, but doing that without reasoning on what is happening behind the scenes (and without using something like hyperfine to obtain significative results) can only help choosing between two solutions. Rationally analyzing what code is generated is necessary to find an optimal solution.

I don't disagree with anything you have said there.

Clearly there are often gains to be had by analyzing the problem at hand, understanding the available algorithms, tweaking the algorithms and/or implementation for performance, using and/or inventing a new algorithm to solve the problem. It helps to understand your target hardware so as to create cache friendly, vectorization friendly code. And so on.

Except for some minor details:

All of the above demands a great deal of experience and expertise. Most programmers do not have expertise in most things. And usually not the time to acquire it. It's great if someone has puts their time into creating a high performance library to do what we want. Otherwise we are on our own.

The solution you presented does not perform better whilst at the same time introducing a lot of obfuscation and employing "unsafe".

Given that the performance difference between using the iterator solution vs a normal nested loop is an order of magnitude I see no reason to be getting into sophisticated bench marking.

In general my gripe is that so far every time the discussion of "idiomatic Rust" comes up I find the suggestions involve convoluted use of iterators and other obfuscations and have worse performance. Or perhaps break even on a good day.

3 Likes

Thank you so much to everyone! All your answers are helpfull.

My code is only for educational porpuse, I'm not intendeed to develop a high performant crate, but I came up with this problem and it caught my attention as the Rust book recommends using iterators over loops.

Moreover, it's surprising that using a Matrix is faster than using an array for data locality (+1 to Zero-cost abstraction), however it's a little dissapointing that I must use unsafe blocks. But it's true that in production, one should use a robust crate.

I hope in future the compiler can optimize the bound checking. I'm new in Rust and I don't understand why allocating a Vec<Vec> with the sentence with_capacity hasn't the same behaviour that allocating with zeros using unsafe when iterates over the structure. But that's another question.

Thank you again!

I am now totally bewildered by all this talk of bounds checking here.

If we write the algorithm in C, as per Genarito's first example using for loops, it runs at the same speed:

$ cat floyd-warshall.c
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>

#define MAX 2048

void floyd(uint32_t * graph, size_t n) {
    for (size_t k = 0; k < n; k++) {
        for (size_t i = 0; i < n; i++) {
            for (size_t j = 0; j < n; j++) {
                uint32_t sum = graph[i * n + k] + graph[k * n + j];
                if (sum < graph[i * n + j]) {
                    graph[i * n + j] = sum;
                }
            }
        }
    }
}

int main(int argc, char* argv[]) {

    uint32_t* a = malloc(MAX * MAX * sizeof(uint32_t));

    floyd(a, MAX);

    return 0;
}
$ clang  -Wall -O3 -o floyd-warshall floyd-warshall.c
$ time ./floyd-warshall

real    0m4.957s
user    0m4.938s
sys     0m0.016s

$ cargo build --release
...
$ time ./target/release/floyd-warshall

real    0m5.007s
user    0m4.875s
sys     0m0.047s

C does not do any bounds checking. Looks like Rust is not doing any either. Or at least doing it very efficiently.

By the way, compiler cannot remove bound checks from such code

fn foo(s: &[u32]) -> u32 {
  let s = &s[..n*n];
  let mut sum = 0;
  for i in 0..n { sum += s[i*n]; }
  sum
}

Overflow for usize is defined, so if n is greater than sqrt(usize::MAX), then n*n wraps around and can be less than n/2*n.

But if n is a constant, constant propagation can do the trick.

Bummer. LLVM doesn't even get it if there's explicit assert!(n < 100), or the end is n_squared/n, etc.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.