How to speed up this rust code? I'm measuring a 30% slowdown versus the C++ version

A while ago I did a comparison between C and Rust to start learning the language and also to check out the relative performance between the two, since the main times I drop down to C are for performance reasons. I recently brought the Rust code up to compliance with 1.0 and re-ran the tests, and in my results, Rust is only able to reach 70% of the speed that I get with the C++ implementation.

Since both compile to native and Rust aims for zero-cost abstractions, ideally the Rust version should not be slower than the C++ version. I still consider myself a newbie to Rust so it's possible I'm doing a few things (or more than a few things!) that I shouldn't be. Could this be related to bounds check or something else?

Here's the code: https://gist.github.com/anonymous/fd213a7b8ffdcfd219dc

Does it perform better if you use iterators?

for (input_ptr, output_ptr) in input.iter().zip(output) {
    push_sample(filter_state, *input_ptr);
    apply_lowpass_single(filter_state);
    *output_ptr = get_output_sample(filter_state);
}

I know the .zip() iterator is sadly not always compiling to perfect code, but give it a shot.

I don't have any performance insights, but I can answer a question you posed in the source code.

In dsp.rs:55 you said, "Note: I didn't understand how to reference y."

The way that you've got your functions set up: you can't take a mutable reference to y at this point. I suspect you got a compiler error on line 60, reading something like "cannot borrow *filter_state as immutable because filter_state.input is also borrowed as mutable."

By taking your y reference, you've borrowed part of filter_state as mutable. Rust's borrowing rules are such that, if anyone has borrowed some or all of an object mutably, no immutable references can be created.

Interestingly, your x immutable reference works fine: Rust is smart enough to see that x and y reference different parts of filter_state.

So where's the immutable reference that offends the borrow checker? In the signature of get_offset:

fn get_offset(filter_state : &FilterState, relative_offset : i32) -> usize

Here, get_offset is saying "I want an immutable reference to an entire FilterState." But you can't do that with y alive.

However, if you read the body of get_offset, it actually only uses filter_state.current. If you switch it to accepting that u32 as a parameter in place of the entire FilterState, you should be good to go.

This seems to be one of those weird stochastic deoptimisations.

I had a look at the asm, and the only major difference between the C++ and the Rust was that apply_lowpass wasn't inlined in C++ but it was in Rust (the actual instructions looked similar, but I didn't dig in deep), so I changed changed the #[inline] on apply_lowpass to #[inline(never)]:

$ ./before 
Beginning Rust tests...


Dummy:0
Iterations:442
Rust Results: 46275501 shorts per second.


$ ./after 
Beginning Rust tests...


Dummy:0
Iterations:645
Rust Results: 67533361 shorts per second.


$ ./dsp 
Beginning C tests...

Dummy:0
Iterations:628
C results: 65848017 shorts per second.
12 Likes

That was it! Now the rustc version is about 10% faster than clang++ on my machine. Thanks so much! :slight_smile:

I tried compiling with the C++ version with -flto and I noticed the same slowdown there, so indeed it must be something related to the inlining. Interestingly enough, if I remove all of the inline tags I still get the slowdown, so the #[inline(never)] has to be there on apply_lowpass() for the perf to match the C++ version, which I suppose will not inline across files unless -flto is enabled.

This could also be a good test of backend perf. The last time I tried g++ I got much better perf than either rustc or clang++ on this test case, but then that's a comparison of backends rather than languages.

2 Likes

I was curious about the ASM since I'm wondering how Rust would actually manage to get more performance out of this even though the same backends are being used. There actually seem to be more instructions in the Rust ASM so I'm not sure what's happening. Could it be related to aliasing guarantees?

  1. The Rust version seems to have some sort of a "stack protector"
    where it does a comparison and calls __morestack if needed.
  2. The C++ version has this but I'm not sure why: testl %ecx, %ecx
  3. There's one bounds check in the Rust version, I believe at the line output[i as usize] = get_output_sample(filter_state);. I guess it's not able to statically see that this can't exceed bounds?
  4. There are some differences in the inner loop in places. Not sure why if the algorithm is the same.
  5. Using @bluss's suggestion of the iterator removed all of the bounds checks but also slowed down the code overall by nearly 10%. Again, I don't know why. I can see the differences by comparing the .s files and the iterator version does seem to result in a bunch more instructions at the end of the loop, which seem to be more costly than the bounds check.

Anyway I find this all interesting, and I'm very happy that Rust is doing so well here even with the bounds check.

1 Like

Hi cbiffle,

Thanks for addressing this part; I had actually forgotten completely about that because that was an old comment from last year when I first looked at Rust, but it's very relevant because I didn't understand borrowing at all then, and probably still don't understand it very well now. :wink: I'll try your ideas out and see if I can get it working like that, and if not, maybe I'll learn more about borrowing along the way.

As for the stack overflow checking in Rust, you can get a closer apples-to-apples comparison by adding -fstack-check to your C++ build (assuming you're using GCC).

This is unlikely to make C++ faster, though. :wink:

I actually went the other way and turned off the stack protection with rust, but I didn't see any difference outside of regular variance. :wink: So for this test case at least, the overhead of the protection is low enough that it's not noticeable in the tests. It would be fun to turn off bounds check (and perhaps unwinding support) as well but I don't think there's an option for that?

Just for reference, there's theoretically an alternative to stack protection prologs that maintains safety and has almost no overhead, but requires either aborting the whole process on overflow or (on platforms other than Windows?) having 'runtime' support (signal handler, etc.) to unwind properly. As described in this year old issue:

I have no idea if anyone is planning to implement it, though.

edit: You can disable unwinding with -Z no-landing-pads. AFAIK this may slightly improve optimization.

1 Like

I think a lot of it doesn't matter in microbenchmarks because icache isn't a constraint there, but the amount of bloat that the average Rust program's assembly output has is really unfortunate (among other things, it makes it harder to actually see the parts you care about). We should definitely switch to guard pages, I basically always compile using them instead when I'm optimizing just for my own sanity. :stuck_out_tongue: Getting the options to abort instantly on panic would also tremendously reduce code bloat.

That's the kind of thing I'm worried about, too. I didn't notice any changes to perf with this test, but in a larger code base, I'd be worried about "death by a thousand papercuts". Being able to turn off unwinding and only having panics for critical errors would be really nice. From what I've seen the Rust team is on board with this, so it's just a matter of time.

I wonder how the Rust team feels about having an option to turn off all runtime checks including bounds checks? In Swift there's -Ounchecked which I believe will turn off all runtime checks. For some types of software like games, the risk of crashes or segfaults may be an acceptable tradeoff for the boost to runtime performance. I'm curious to see how the story evolves here. :slight_smile:

This we know. It's been decided since long there will be no such option.

I guess I can understand the reasoning behind that, since if the option was there a lot of people might be tempted to use it, which brings us back to undefined behavior. I guess if someone -really- wanted to have it, they could also just fork the compiler and turn the checks off that way.

Or you could just use unsafe constructs that don't do any runtime checks. It's less convenient, but that's kinda the whole point.

If you need more speed, go for it, but Rust won't let you ignore that what you're doing is potentially unsafe.

4 Likes

I am still a Rust rookie. I am the developer of the Sappeur language and compiler, though.

Three points:

A) "30% slower" is in my opinion still very good. Rust is still in its infancy.

B) Micro-benchmarks are often not very useful.

C) There is more than one dimension in the IT industry. Too long we have focused on "performance and nothing but performance". If the price of much better security is a 30% performance reduction, we get an excellent deal.

Also see http://frankgerlach.d-n-s.name/Ansaetze.html

That's interesting! I would have expected index bounds checking to be the culprit.

As for filtering: I actually wrote some filtering code in Rust using SIMD by manually pipelining four biquads. That was pretty fast which was what I was going for (software-defined radio stuff usually means very high sampling rates). If anyone is interested, I'll try to find it again...

(biquad = second order IIR filter).

With fixed sized arrays and the static % I would expect that many of the checks get removed. And, even if they aren't, bounds checking is predicted very well and there's no much code in these microbenchmarks, so I would imagine there's no significant i-cache pressure.

Maybe compiler options that trade safety to performance (like turned off bounds checking) should have unsafe in their name? Like -Z unsafe-no-bounds-check?

what would be the switch that I need to activate on rust and C++ for compare code (and what other tools do you use for this type of analysis?)