How to speed up this rust code? I'm measuring a 30% slowdown versus the C++ version

loes · May 20, 2015, 10:10pm

A while ago I did a comparison between C and Rust to start learning the language and also to check out the relative performance between the two, since the main times I drop down to C are for performance reasons. I recently brought the Rust code up to compliance with 1.0 and re-ran the tests, and in my results, Rust is only able to reach 70% of the speed that I get with the C++ implementation.

Since both compile to native and Rust aims for zero-cost abstractions, ideally the Rust version should not be slower than the C++ version. I still consider myself a newbie to Rust so it's possible I'm doing a few things (or more than a few things!) that I shouldn't be. Could this be related to bounds check or something else?

Here's the code: https://gist.github.com/anonymous/fd213a7b8ffdcfd219dc

bluss · May 20, 2015, 10:56pm

Does it perform better if you use iterators?

for (input_ptr, output_ptr) in input.iter().zip(output) {
    push_sample(filter_state, *input_ptr);
    apply_lowpass_single(filter_state);
    *output_ptr = get_output_sample(filter_state);
}

I know the .zip() iterator is sadly not always compiling to perfect code, but give it a shot.

cbiffle · May 20, 2015, 11:31pm

I don't have any performance insights, but I can answer a question you posed in the source code.

In dsp.rs:55 you said, "Note: I didn't understand how to reference y."

The way that you've got your functions set up: you can't take a mutable reference to y at this point. I suspect you got a compiler error on line 60, reading something like "cannot borrow *filter_state as immutable because filter_state.input is also borrowed as mutable."

By taking your y reference, you've borrowed part of filter_state as mutable. Rust's borrowing rules are such that, if anyone has borrowed some or all of an object mutably, no immutable references can be created.

Interestingly, your x immutable reference works fine: Rust is smart enough to see that x and y reference different parts of filter_state.

So where's the immutable reference that offends the borrow checker? In the signature of get_offset:

fn get_offset(filter_state : &FilterState, relative_offset : i32) -> usize

Here, get_offset is saying "I want an immutable reference to an entire FilterState." But you can't do that with y alive.

However, if you read the body of get_offset, it actually only uses filter_state.current. If you switch it to accepting that u32 as a parameter in place of the entire FilterState, you should be good to go.

huon · May 21, 2015, 6:27am

This seems to be one of those weird stochastic deoptimisations.

I had a look at the asm, and the only major difference between the C++ and the Rust was that apply_lowpass wasn't inlined in C++ but it was in Rust (the actual instructions looked similar, but I didn't dig in deep), so I changed changed the #[inline] on apply_lowpass to #[inline(never)]:

$ ./before 
Beginning Rust tests...


Dummy:0
Iterations:442
Rust Results: 46275501 shorts per second.


$ ./after 
Beginning Rust tests...


Dummy:0
Iterations:645
Rust Results: 67533361 shorts per second.


$ ./dsp 
Beginning C tests...

Dummy:0
Iterations:628
C results: 65848017 shorts per second.

loes · May 21, 2015, 1:30pm

That was it! Now the rustc version is about 10% faster than clang++ on my machine. Thanks so much!

I tried compiling with the C++ version with -flto and I noticed the same slowdown there, so indeed it must be something related to the inlining. Interestingly enough, if I remove all of the inline tags I still get the slowdown, so the #[inline(never)] has to be there on apply_lowpass() for the perf to match the C++ version, which I suppose will not inline across files unless -flto is enabled.

This could also be a good test of backend perf. The last time I tried g++ I got much better perf than either rustc or clang++ on this test case, but then that's a comparison of backends rather than languages.

loes · May 21, 2015, 2:02pm

I was curious about the ASM since I'm wondering how Rust would actually manage to get more performance out of this even though the same backends are being used. There actually seem to be more instructions in the Rust ASM so I'm not sure what's happening. Could it be related to aliasing guarantees?

The Rust version seems to have some sort of a "stack protector"
where it does a comparison and calls __morestack if needed.
The C++ version has this but I'm not sure why: testl %ecx, %ecx
There's one bounds check in the Rust version, I believe at the line output[i as usize] = get_output_sample(filter_state);. I guess it's not able to statically see that this can't exceed bounds?
There are some differences in the inner loop in places. Not sure why if the algorithm is the same.
Using @bluss's suggestion of the iterator removed all of the bounds checks but also slowed down the code overall by nearly 10%. Again, I don't know why. I can see the differences by comparing the .s files and the iterator version does seem to result in a bunch more instructions at the end of the loop, which seem to be more costly than the bounds check.

Anyway I find this all interesting, and I'm very happy that Rust is doing so well here even with the bounds check.

loes · May 21, 2015, 3:45pm

Hi cbiffle,

Thanks for addressing this part; I had actually forgotten completely about that because that was an old comment from last year when I first looked at Rust, but it's very relevant because I didn't understand borrowing at all then, and probably still don't understand it very well now. I'll try your ideas out and see if I can get it working like that, and if not, maybe I'll learn more about borrowing along the way.

cbiffle · May 21, 2015, 4:00pm

As for the stack overflow checking in Rust, you can get a closer apples-to-apples comparison by adding -fstack-check to your C++ build (assuming you're using GCC).

This is unlikely to make C++ faster, though.

loes · May 21, 2015, 5:13pm

I actually went the other way and turned off the stack protection with rust, but I didn't see any difference outside of regular variance. So for this test case at least, the overhead of the protection is low enough that it's not noticeable in the tests. It would be fun to turn off bounds check (and perhaps unwinding support) as well but I don't think there's an option for that?

comex · May 21, 2015, 5:22pm

Just for reference, there's theoretically an alternative to stack protection prologs that maintains safety and has almost no overhead, but requires either aborting the whole process on overflow or (on platforms other than Windows?) having 'runtime' support (signal handler, etc.) to unwind properly. As described in this year old issue:

github.com/rust-lang/rust

Replace stack overflow checking with stack probes

opened 08:28PM - 26 Jul 14 UTC

closed 09:12PM - 14 Jul 17 UTC

Zoxc

A-codegen P-medium T-compiler I-unsound

We currently abuse LLVM segmented stack support to check for stack overflows. It… would be more efficient to use guard pages to detect these. We already have guard pages on all the stacks. However to ensure that the code doesn't skip the guard pages, we need to insert stack probes. LLVM already can generate code for that on x86 and ARM for Windows. We'd just need to expose that as an option on other platforms. It would be nice if support for stack probes could be added to MIPS in LLVM too, so we can get rid of the runtime support for the stack overflow checking. Using stack probes is also easy and desirable to support in freestanding mode.

I have no idea if anyone is planning to implement it, though.

edit: You can disable unwinding with -Z no-landing-pads. AFAIK this may slightly improve optimization.

pythonesque · May 22, 2015, 1:12am

I think a lot of it doesn't matter in microbenchmarks because icache isn't a constraint there, but the amount of bloat that the average Rust program's assembly output has is really unfortunate (among other things, it makes it harder to actually see the parts you care about). We should definitely switch to guard pages, I basically always compile using them instead when I'm optimizing just for my own sanity. Getting the options to abort instantly on panic would also tremendously reduce code bloat.

loes · May 22, 2015, 2:49pm

That's the kind of thing I'm worried about, too. I didn't notice any changes to perf with this test, but in a larger code base, I'd be worried about "death by a thousand papercuts". Being able to turn off unwinding and only having panics for critical errors would be really nice. From what I've seen the Rust team is on board with this, so it's just a matter of time.

I wonder how the Rust team feels about having an option to turn off all runtime checks including bounds checks? In Swift there's -Ounchecked which I believe will turn off all runtime checks. For some types of software like games, the risk of crashes or segfaults may be an acceptable tradeoff for the boost to runtime performance. I'm curious to see how the story evolves here.

bluss · May 22, 2015, 3:36pm

This we know. It's been decided since long there will be no such option.

loes · May 23, 2015, 1:47am

I guess I can understand the reasoning behind that, since if the option was there a lot of people might be tempted to use it, which brings us back to undefined behavior. I guess if someone -really- wanted to have it, they could also just fork the compiler and turn the checks off that way.

DanielKeep · May 23, 2015, 7:36am

Or you could just use unsafe constructs that don't do any runtime checks. It's less convenient, but that's kinda the whole point.

If you need more speed, go for it, but Rust won't let you ignore that what you're doing is potentially unsafe.

frank_gerlach · June 30, 2015, 1:54pm

I am still a Rust rookie. I am the developer of the Sappeur language and compiler, though.

Three points:

A) "30% slower" is in my opinion still very good. Rust is still in its infancy.

B) Micro-benchmarks are often not very useful.

C) There is more than one dimension in the IT industry. Too long we have focused on "performance and nothing but performance". If the price of much better security is a 30% performance reduction, we get an excellent deal.

Also see http://frankgerlach.d-n-s.name/Ansaetze.html

sellibitze · June 30, 2015, 2:52pm

That's interesting! I would have expected index bounds checking to be the culprit.

As for filtering: I actually wrote some filtering code in Rust using SIMD by manually pipelining four biquads. That was pretty fast which was what I was going for (software-defined radio stuff usually means very high sampling rates). If anyone is interested, I'll try to find it again...

(biquad = second order IIR filter).

huon · June 30, 2015, 4:24pm

With fixed sized arrays and the static % I would expect that many of the checks get removed. And, even if they aren't, bounds checking is predicted very well and there's no much code in these microbenchmarks, so I would imagine there's no significant i-cache pressure.

vi0 · July 13, 2015, 9:47am

Maybe compiler options that trade safety to performance (like turned off bounds checking) should have unsafe in their name? Like -Z unsafe-no-bounds-check?

tyoc213 · August 9, 2015, 11:31pm

what would be the switch that I need to activate on rust and C++ for compare code (and what other tools do you use for this type of analysis?)

Topic		Replies	Views
Performance questions code review	16	844	September 30, 2020
Simple Rust and C# performance comparison help	12	10216	September 19, 2020
Rust performance help (convolution) help	37	3440	November 2, 2020
Looking for help understanding Rust's performance vs C++ community	28	6971	November 1, 2019
M:N Threading, C, Rust and Java? help	7	2049	January 12, 2023

How to speed up this rust code? I'm measuring a 30% slowdown versus the C++ version

Related Topics