Vectors, scope, and performance

Hello all,

I'm a newcomer here -- I've had a few pokes and prods at rust in the past but only in the last few weeks started a really serious dive. As a result I've come up with some unexpected discoveries about performance in rust, which I'm hoping someone can help me understand better.

(TL;DR this is not a "Help my program doesn't work!" request; it's a "Hmm, my program does something better than I expected and I'd like to understand why..." inquiry.)

One of the things that I tend to do with a new language is to see if I can port some existing small project of mine. In this case I decided to port a little simulation back from my days in academia, the results of which are here:

The design is pretty simple -- there's a bunch of vectors that get instantiated up front and then various functions operate on these as mutable buffers in an inner loop (for anyone interested, the simulation is of co-evolving reputation measures for objects and users based on ratings given; the original was written to calculate results for this paper: [1001.3745] The effect of discrete vs. continuous-valued ratings on reputation and ranking systems). In terms of functionality and performance all is good.

However, while some of those vectors are genuinely output of the calculation, others are really just buffers used to store the results of under-the-hood calculations. So for fun, I decided to see what would happen if I just encapsulated those vector instances inside the scope in which they were really needed, and generated them afresh inside that scope for each use. The results are in this branch:
https://github.com/WebDrake/dregrs/tree/encapsulate-vectors

What I was expecting was that this would result in the vectors being reallocated multiple times inside the inner loop, and therefore kill performance. What I found instead was that performance was not affected at all; the program was just as fast.

My first thought was that since the lengths of the various vectors is a constant known at compile time, maybe the compiler is able to be smart enough to use that to recognize that a single fixed-size buffer can be instantiated under the hood and reused rather than reallocated. So I added an extra patch to make the vector sizes runtime parameters passed via command-line arguments:
https://github.com/WebDrake/dregrs/tree/runtime-vec-lengths

... but again there was no performance impact. So at this point I'm stumped; the compiler is obviously doing something smart that avoids unneeded reallocation of vector instances used inside the inner loops, but I don't understand why or how it is able to do this. (One thought is that since the vectors are relatively small in size -- 1000 elements -- maybe the compiler is able to recognize that they can be allocated on the stack, but I'm not sure how to confirm that.)

So, if anyone has any thoughts or ideas on why rust is able to achieve these performance outcomes, I'd be happy to hear. (It's nice to be able to submit a help post asking why things are unexpectedly good, rather than something not working...)

Thanks & best wishes,

 -- Joe
3 Likes

No. Jemalloc is just that good.

1 Like

Simple example, same memory gets reused by allocator even in debug.

FWIW, with the system allocator, it also passes that test.

It may still be the case that jemalloc is beneficial to OPs workload, but don't count glibc out either. :slight_smile:

1 Like

Hi folks -- thanks for the replies and insights. I hadn't even thought of the malloc implementation as a possible influence, but on checking it out it seems the behaviour (including the speed of the program) is the same whether I use the default jemalloc or set the system allocator (which I'm presuming is not jemalloc on Ubuntu Linux).

I guess the real question I'm considering here is whether this sort of instantiate-vectors-in-the-scope-where-they're-used approach is a reasonable design pattern for rust, or whether something closer to my first implementation is preferable. But I expect that's only really answerable on a case by case basis ...

Feel like llvm can hoist the allocation out of an inner scope even if the length of the allocation is determined at runtime, so long as it knows that the length will be the same in each scope.

I guess to answer the question of whether this is an allocator optimization or a compiler optimization, we need to know where the allocation is happening. Maybe make a flamegraph? If you see libc / malloc being called for each calculate_* function, then it's not the compiler. If You see it being called once before all of the calculate_* function calls, then it is the compiler.

1 Like

I've not worked with flamegraphs before, and the results of trying weren't very informative, but that may be my inexperience with the technique. I basically tried the technique described on Linux perf Examples and the following was the result:

... which suggests to me that the compiler has done some serious under-the-hood reorganization of things.

So, I tried running the program in gdb and putting a breakpoint on this function from the inner loop:

fn calculate_user_divergence(users: usize,
    object_reputation: &Vec<f64>, ratings: &Vec<Rating>) -> Vec<f64> {
    let mut user_divergence: Vec<f64> = vec![0.0; users];
    for r in ratings.iter() {
        let aux = r.weight - object_reputation[r.object];
        user_divergence[r.user] += aux * aux;
    }

    user_divergence
}

... which when gdb landed on it, placed me direct on the line after the vector initialization:

Breakpoint 1, dregrs::calculate_user_divergence::hd09975feb6539130 (users=1000, 
    object_reputation=<optimised out>, ratings=<optimised out>) at src/main.rs:75
75	        let aux = r.weight - object_reputation[r.object];

... suggesting that at least the allocation of the vector has been optimized away to a different location. Jumping up to a higher level with a breakpoint here:

fn calculate_user_reputation(user_reputation: &mut Vec<f64>,
    user_links: &Vec<usize>, object_reputation: &Vec<f64>,
    ratings: &Vec<Rating>, exponent: f64, min_divergence: f64) {
    let user_divergence: Vec<f64> =
        calculate_user_divergence(user_reputation.len(),
            object_reputation, ratings);

    for (u, rep) in user_reputation.iter_mut().enumerate() {
        if user_links[u] > 0 {
            let base = (user_divergence[u] / user_links[u] as f64) + min_divergence;
            *rep = base.powf(-exponent);
        } else {
            *rep = 0.0;
        }
    }
}

... I see something similar, with the breakpoint landing on the inside of the for loop:

Breakpoint 1, dregrs::calculate_user_reputation::hd4e0a211e1ed0a9e (
    user_reputation=<optimised out>, exponent=0.80000000000000004, 
    min_divergence=9.9999999999999994e-37, user_links=<optimised out>, 
    object_reputation=<optimised out>, ratings=<optimised out>) at src/main.rs:90
90	        if user_links[u] > 0 {

If I try to put a breakpoint on the loop inside the YZLM.calculate_reputation method:

        loop {
            calculate_user_reputation(user_reputation,
                &user_links, object_reputation, ratings,
                self.exponent, self.min_divergence);

            reputation_buf.clone_from(object_reputation);

            calculate_object_reputation(object_reputation,
                user_reputation, ratings);

            *diff = 0.0;

            for (o, rep) in object_reputation.iter().enumerate() {
                let aux = rep - reputation_buf[o];
                *diff += aux * aux;
            }

... then when the breakpoint triggers I actually land right on the call to calculate_object_reputation several lines in. All of which suggests that the compiler has been doing some pretty hefty rewriting here.

Finally, I gave a go with valgrind. First of all, callgrind indeed suggests that there's been quite a lot of rewriting under the hood:

--------------------------------------------------------------------------------
             Ir 
--------------------------------------------------------------------------------
157,354,143,374  PROGRAM TOTALS

--------------------------------------------------------------------------------
            Ir  file:function
--------------------------------------------------------------------------------
43,858,989,700  src/main.rs:dregrs::calculate_object_reputation::h8ca6352c62ab9961 [target/release/dregrs]
43,858,965,790  /checkout/src/libcore/slice/mod.rs:dregrs::calculate_object_reputation::h8ca6352c62ab9961
35,905,930,006  /checkout/src/libcore/slice/mod.rs:dregrs::main::h545a5626b48e9fa2
19,679,914,619  src/main.rs:dregrs::main::h545a5626b48e9fa2 [target/release/dregrs]
 3,989,012,895  /checkout/src/liballoc/vec.rs:dregrs::calculate_object_reputation::h8ca6352c62ab9961
 3,985,035,865  /checkout/src/liballoc/raw_vec.rs:dregrs::calculate_object_reputation::h8ca6352c62ab9961
 1,508,393,436  /checkout/src/liballoc/vec.rs:dregrs::main::h545a5626b48e9fa2
 1,102,219,525  /checkout/src/libcore/num/mod.rs:dregrs::main::h545a5626b48e9fa2
   801,600,000  /checkout/src/libcore/num/wrapping.rs:dregrs::main::h545a5626b48e9fa2
   652,467,541  /build/glibc-Cl5G7W/glibc-2.23/math/../sysdeps/ieee754/dbl-64/e_pow.c:__ieee754_pow_sse2 [/lib/x86_64-linux-gnu/libm-2.23.so]
   500,154,625  /checkout/src/libcore/ptr.rs:dregrs::main::h545a5626b48e9fa2

... and memcheck suggests that there is no heap allocation happening at all:

==5137== 
==5137== HEAP SUMMARY:
==5137==     in use at exit: 0 bytes in 0 blocks
==5137==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==5137== 
==5137== All heap blocks were freed -- no leaks are possible
==5137== 
==5137== For counts of detected and suppressed errors, rerun with: -v
==5137== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

... which makes me recall Steve Klabnik's remarks on escape analysis and heap vs. stack allocation in his recent article:

Is this likely to be what's happening here?

Did you use memcheck with the system allocator? I don't think it can see jemalloc allocations correctly. There shouldn't be any difference in how often each allocator is called, regardless of their relative performance.

Yea, after thinking about it I started wondering the same. Seems the most likely explanation as I don't think this program could work without allocating heap memory.