Decimator algorithm is too slow

Hello, I interested for digital signal processing, so I write primitives in Rust and C.
But I have a problem: the Rust code is too slow. The test platform is Aarch64 (Odroid-C2).

Here is my decimator-code: https://github.com/hg2ecz/sdr-mixer-test

C code - basic:
Decimator(10) inspeed: 34.7 Msps
Decimator(10) outspeed: 3.5 Msps

C code - NEON optimalized:
Decimator(10) inspeed: 51.2 Msps
Decimator(10) outspeed: 5.1 Msps

Rust: decimator-rust-basic
Decimator(10) inspeed: 12.4 Msps
Decimator(10) outspeed: 1.2 Msps

The decimator-function is here: https://github.com/hg2ecz/sdr-mixer-test/blob/master/decimator-rust-basic/src/decimator.rs#L90
It is too slow for me. How can I write 3…4x faster code in Rust?

Are you sure you’re running this with a --release flag enabled? Rust becomes alot faster in release mode compared to debug mode.

I used cargo with --release option (inspeed 12.4 Mbps). Without this option the inspeed is only 0.2 Msps.

One thing is that your Rust code is allocating a new resvec every time, whereas the C code is getting a stack array from the caller.

You could at least create the Rust vector with Vec::with_capacity(usize) so the pushes won’t reallocate, but it would be more comparable if you also preallocate it in the caller, excluded from the timing measurements.

5 Likes

The other obvious optimization would be to use iterators to avoid the bounds check on indexing. But first priority would be to avoid unnecessary allocations, since that is a design issue.

The new test I use preallocated variable for outsample (resvec).

    pub fn decimator(&mut self, outsample: &mut Vec<Complex<f32>>, sample: &Vec<Complex<f32>>) {
        outsample.clear();
        ...

and I use unchecked bounds for inputs.

    self.sampledecimmixbuf.push( unsafe { sample.get_unchecked(i) * self.oscillator } );
    ...
    ...
    for j in 0..self.coeff.len() { // instead of iterator
         //out += self.sampledecimmixbuf[i+j] * self.coeff[j];
         out += unsafe { self.sampledecimmixbuf.get_unchecked(i+j) * self.coeff.get_unchecked(j) };
    }
    outsample.push(out); // here should be an unsafe but fast trick

Decimator(10) inspeed: 14.0 Msps … 15 % faster than basic example, but it’s much lesser than C basic code (34,7 Msps).

If I don’t use mixer part (is_mix = false), the inspeed is 18.3 Msps. It’s too slow compared to basic C (53.5 Msps without mixer).

Any idea?

@Zsolt your code is not very idiomatic. At first, it gets a reference to a Vec, see

Next you are using indexed based indexing into the vec, instead of using an iterator, e.g. for element in &sample. Same goes for the next loop as well.

Last, don’t use return as last statement, see

I know this won’t help you with your performance problems, but makes you a better rust programmer :wink: (btw. Clippy is a great tool which tells you exactly this and many more. You can install it by using rustup component add clippy and then using cargo clippy in your project).

I’ve just been looking at your current code which is using iterators and saw one little place you could improve it:

resvec.push ( self.sampledecimmixbuf[i..i+self.coeff.len()].iter().zip(&self.coeff).map(|(sa, co)| sa * co).sum() );

You can actually omit the stop point of the slice (i+self.corff.len()), since zip will automatically truncate according to the shorter of the two slices. This removes one fiddly place to make a mistake and also reduces by one the number of bounds checks the compiler needs to make. Any place where I need to ask myself “is there an error here?” it’s a place where the compiler needs to add a runtime check to do the same. It’s not likely to affect your runtime, but makes your code prettier.

Thank you for cargo clippy. I corrected my code.

It looks like you could simplify your nested loop by using chunks to get the successive slices of self.sampledecimmixbuf. That could save some bounds checking as chunks can do everything optimally. But more importantly it would express what your code is doing more clearly.

Amusingly, this would make irrelevant my previous suggestion, since you’d no longer be manually taking slices of this.

Bounds-checks and memory allocations are entirely irrelevant in this case; profiling it with perf shows that almost all time is spent performing floating math operations.

When I run the code, I get:

C basic
Decimator(10) inspeed: 146.5 Msps
Decimator(10) outspeed: 14.7 Msps

Rust
Decimator(10) inspeed: 97.8 Msps
Decimator(10) outspeed: 9.8 Msps

I notice a couple of things:

  • The C code returns real numbers while the rust code returns complex numbers. This seems…unusual? This repository really needs an easy way to test that the programs are producing the same output. (were you aware that the rust output contains a NaN?)
  • The C code has -Ofast, which in turn enables -ffast-math, causing it to perform optimizations that assume float operations are associative and distributive. If I disable it, this severely hampers the performance:
CFLAGS=-Wall -Ofast -march=native -funroll-all-loops
Decimator(10) inspeed: 146.5 Msps
Decimator(10) outspeed: 14.7 Msps

CFLAGS=-Wall -O3 -march=native -funroll-all-loops
Decimator(10) inspeed: 49.6 Msps
Decimator(10) outspeed: 5.0 Msps
  • The C build also optimizes for architecture. Doing this to the rust build does not seem to do much on my machine:
cargo run --release
Decimator(10) inspeed: 97.8 Msps
Decimator(10) outspeed: 9.8 Msps

RUSTFLAGS="-C target-cpu=native" cargo run --release
Decimator(10) inspeed: 98.5 Msps
Decimator(10) outspeed: 9.8 Msps
7 Likes

Kudos for proper profiling, rather than making guesses (guilty).

It would also be good to use CC=clang, so we can compare the languages with the same underlying codegen involved, rather than GCC vs. LLVM.

RE: -Ofast and -ffast-math, LLVM opt supports -fp-contract=fast, but I haven’t figured out how to pass that through rustc. At least, it doesn’t work in -Cllvm-args, but maybe we can figure out exactly what Clang is passing when it gets -ffast-math.

1 Like

Good catch with gcc versus clang. Looks like clang does better under -O3.

CC=clang CFLAGS="-Wall -Ofast -march=native -funroll-all-loops"
Decimator(10) inspeed: 147.5 Msps
Decimator(10) outspeed: 14.8 Msps

CC=clang CFLAGS="-Wall -O3 -march=native -funroll-all-loops"
Decimator(10) inspeed: 97.1 Msps
Decimator(10) outspeed: 9.7 Msps

So the rust code performs comparable to how the C code would under -O3.

6 Likes