Vec's overhead compared to C++ for DSP application

Hello everyone!

In order to choose between Rust and C++ for a new project, I'm comparing both languages and learning Rust in the process with a IIR Biquad filter implementation.

C++ DSP bench repo
Rust DSP bench repo

##Benchmark results

Raspberry Pi 3

####C++
#####clang 3.5.0

94482 ns per loop

#####gcc 4.9

91171 ns per loop

####Rust (stable 1.14)

iir_vec:                        166215 ns per loop
iir_vec_2:                      123959 ns per loop
iir_array (4096):               78659 ns per loop
iir_slice (array):              75342 ns per loop
iir_slice (4096 array):         78747 ns per loop
iir_slice_unsafe (4096 array):  78827 ns per loop
iir_slice (vec):                155377 ns per loop
iir_slice_unsafe (vec):         155396 ns per loop

i7-2630QM CPU @ 2.00GHz laptop

####C++
#####clang 3.8.0

16184 ns per loop

#####gcc 5.4

17760 ns per loop

####Rust (stable 1.14)

iir_vec:                        31749 ns per loop
iir_vec_2:                      21435 ns per loop
iir_array (4096):               17156 ns per loop
iir_slice (array):              17152 ns per loop
iir_slice (4096 array):         15667 ns per loop
iir_slice_unsafe (4096 array):  15741 ns per loop
iir_slice (vec):                31589 ns per loop
iir_slice_unsafe (vec):         31613 ns per loop

##Observations

  • When working with a fixed size array's slice, Rust performance is higher than GCC or clang C++ on ARM Cortex-A53 and Intel :thumbsup:
  • slice from an array performs great, but slice from a Vector is twice as slow (and slower than C++)
  • This algorithm's performance with Vector depends a lot on how many read and write are performed: iir_vec_2 is almost the same as to iir_vec but 50% faster.
    Another version positioning output[i] = outval at the loop's end was slower faster.
  • GCC beats clang for this ARM platform in this case, clang is faster on Intel.

##Questions

  • Is it expected that the same algorithm with Vec runs at half the speed as with an array slice?
  • How would you rewrite this algorithm?
  • In the real project, defining the buffer size at runtime will be required but it doesn't seem to be possible with arrays in Rust.
    Would you recommend setting buffers to the maximum array size and slice them to the desired size (performance is lower that way)?
  • Are there structures better suited to contain buffers in that case, like Box?
  • I'll try to SIMD this algorithm later by running the IIR filter on several channel simultaneously: Would that change the type of data structures a lot?

Thanks :slight_smile:

10 Likes

One change you can make that might improve performance is to use the iterating methods on vec. Indexing will do bounds checking each time, so instead you could call input.iter_mut().zip(output.iter_mut()).map(|(in, out)| { ... }) (untested, on mobile). The compiler is really good at optimizing code like this. This should dramatically reduce the amount of bounds checking being done.

1 Like

Thanks @Nashenas88 it makes sense, I'll try right away with zip() on Vec and arrays.

From rust-beginners IRC channel where Icefoz experiments as well:

Aha, interesting. In --release mode, the iir_slice() function gets inlined into main().
So yes, it can and does generate rather different code for the two.

Compare iir_slice_array · GitHub and iir_slice_vec · GitHub
The array version gets unrolled harder, it looks like, and it does indeed know exactly how long the array is.
(See the cmp $0x1000, %rcx on line 50)

On my machine there's no significant difference between iir_array and iir_slice being passed an array.
gist:3bb8fa75fa69b77b58238a5b350c855b · GitHub
I just checked with 1.14 and got the same result. Also get the same result on an AMD APU system.

@Nashenas88 I now tried with zip(), but had to use for instead of map since the routine does not return anything

i7-2630QM CPU @ 2.00GHz laptop, Rust (stable 1.1.4)

iir_vec:                        31870 ns per loop
iir_vec_zip:                    31678 ns per loop
iir_vec_zip_2:                  31504 ns per loop
iir_vec_2:                      21545 ns per loop
iir_array (4096):               17231 ns per loop
iir_slice (array):              17119 ns per loop
iir_slice (4096 array):         15782 ns per loop
iir_slice_unsafe (4096 array):  15680 ns per loop
iir_slice (vec):                31801 ns per loop
iir_slice_unsafe (vec):         31620 ns per loop

Unfortunately it provides no improvement compared to iir_vec.
Am I doing it correctly by using *y like that ?

By the way, the C++ version is probably being slowed down by the compiler assuming the buffers of in and out (or even the vector objects themselves, depending on inlining) could alias. It might be interesting to compare a version that takes raw pointers using __restrict to indicate no aliasing.

2 Likes

You might want to check out How to “zip” two slices efficiently where tricks to enable the compiler to avoid bounds checks while indexing slices is explained.

It looks like a bit of extra work enables the compiler to prove that all indexing is within bounds.

1 Like

There has been a lot of work on optimizing zip since that article.

There has been a lot of work on optimizing zip since that article.

I didn't know. That's cool! But how come these optimizations don't seem to happen here? My interpretation of these benchmarks is that the best performance happens when the compiler is able to prove that all indexing accesses are in bounds, but an efficient zip should result in similar performance in my opinion.

On my machine, using zip or explicitelly slicing the slices to the same length improves the performance of iir_slice (vec), but not to the degree of iir_slice (array) and iir_slice (4096). I think this shows the effect of bounds check.

My guess about the rest of the performance gap would be that the compiler does not see that the vector has a constant length, while it does some aggressive optimizations bases on total size for slice and array case.

Alright I followed your recommendations and extended the tests.
There's a lot of things to see.

###Benchmark results

####Raspberry Pi 3
#####C++
######clang 3.5.0

iir:    94665 ns per loop
iir_2:  90119 ns per loop

######gcc 4.9

iir:    90837 ns per loop
iir_2:  90292 ns per loop

####Rust (stable 1.14)

iir_vec:                        159611 ns per loop
iir_vec_2:                      124729 ns per loop
iir_vec_3:                      111157 ns per loop
iir_vec_zip:                    166023 ns per loop
iir_vec_zip_2:                  156296 ns per loop
iir_vec_zip_3:                  110870 ns per loop
iir_vec_zip_4:                  107475 ns per loop
iir_vec_zip_5:                  106412 ns per loop
iir_slice (vec):                149067 ns per loop
iir_slice_zip (vec):            103591 ns per loop
iir_slice_zip_2 (vec):          103542 ns per loop
iir_slice_unsafe (vec):         149074 ns per loop
iir_array (4096):               79194 ns per loop
iir_slice (sliced array):       75384 ns per loop
iir_slice (whole array):        78987 ns per loop
iir_slice_zip (whole array):    79207 ns per loop
iir_slice_zip_2 (sliced array): 71970 ns per loop
iir_slice_zip_2 (whole array):  78833 ns per loop
iir_slice_unsafe (whole array): 78812 ns per loop

####i7-2630QM CPU @ 2.00GHz laptop
#####C++
######clang 3.8.0

iir:    16418 ns per loop
iir_2:  16387 ns per loop

######gcc 5.4

iir:    17834 ns per loop
iir_2:  17681 ns per loop

####Rust (stable 1.14)

iir_vec:                        32107 ns per loop
iir_vec_2:                      21450 ns per loop
iir_vec_3:                      18679 ns per loop
iir_vec_zip:                    31697 ns per loop
iir_vec_zip_2:                  31651 ns per loop
iir_vec_zip_3:                  18572 ns per loop
iir_vec_zip_4:                  18684 ns per loop
iir_vec_zip_5:                  18587 ns per loop
iir_slice (vec):                31592 ns per loop
iir_slice_zip (vec):            18551 ns per loop
iir_slice_zip_2 (vec):          18832 ns per loop
iir_slice_unsafe (vec):         31482 ns per loop
iir_array (4096):               17197 ns per loop
iir_slice (sliced array):       17201 ns per loop
iir_slice (whole array):        15719 ns per loop
iir_slice_zip (whole array):    17101 ns per loop
iir_slice_zip_2 (sliced array): 19249 ns per loop
iir_slice_zip_2 (whole array):  17131 ns per loop
iir_slice_unsafe (whole array): 15655 ns per loop

Observations

  • zip alone does not help
  • Other optimizations help, but in each cases zip doesn't to make much difference for Vecs or Arrays compared to random access.
  • for (&x, y) with zip is slightly faster than for (x, y) as x can be immutable
  • Some naive optimization on the C++ version provides marginally faster result. I fixed a bug in the algorithm also which lead to believe C++ was much faster here. Thanks to stephaneyfx on rust-beginners IRC for identifying it!
  • On ARM / Raspberry Pi 3 the Rust array version remains faster. This is my target so I'm happy with that.
  • On ARM/Pi3, the fastest array version is still 40+% faster than the fastest Vector one

Questions

  • In the real software, I can think of a hack with a few methods like get_buffer_32() [f64; 32] ->{let mut arr = [0.0; 4096]; return arr} and so on for each possible buffer size (from 32 to 4096). It's not pretty but I could live with it.
    Can you think of any alternative to Vectors and Arrays, like an Array which size can be set at runtime?
  • Anything else you'd like to be tried?

Next up: try with stereo with SIMD :slight_smile:

2 Likes

You can try to use generic-array to reduce amount of boilerplate code. Also you can try to use GenericArray::from_slice on sliced vector, maybe it will allow compiler to perform optimizations.

1 Like

I fixed a bug with the C++ version which lead to invalid results. Now solved in the benchmark results and source code on github.

2 Likes

Thanks newpavlov, I tried but didn't succeed to figure out how to use GenericArray::from_slice.
Would you be able to show an example in the context of the benchmark code?

Have you tried comparing the performance of ArrayVec from the arrayvec crate to a standard array? That's what I would highly be interested in. It's a stack-allocated variant of Vec that cannot grow, like a regular array, but with Vec convenience.

1 Like

Also, there is work being done to create a Rust variant of alloca, which will allow you create variable-sized arrays whose size is determined at runtime. Should be very useful for what you are wanting to do.

1 Like

I had something like this in mind:

// This function is generic over "length" of generic array which is represented using types from typenum crate
fn iir<N>(input: &GenericArray<f64, N>, output: &mut GenericArray<f64, N>, bq: &mut Biquad)
    where N: ArrayLength<f64>
{
    ...
}

input_data = ... // let it be &[f64] or &Vec<f64>
output_data = ... // let it be &mut [f64] or &mut Vec<f64>

assert_eq!(input_data.len(), output_data.len());
// this can probably be implemented a bit nicer using macro
match input_data.len() {
    32 => {
        let input: &GenericArray<f64, U32> = GenericArray::from_slice(input_data);
        let output: &mut GenericArray<f64, U32> = GenericArray::from_mut_slice(output_data);
        iir(input, output, bq);
    },
    64 => {
        let input: &GenericArray<f64, U64> = GenericArray::from_slice(input_data);
        let output: &mut GenericArray<f64, U64> = GenericArray::from_mut_slice(output_data);
        iir(input, output, bq);
    }
    ...
    _ => panic!("unsupported length")
}

I haven't tested it, but hope the main idea is clear. Of course monomorphization will give you a lot of essentially same functions which can be quite bad from the cache perspective.

Thanks everyone for all the suggestion, I'll try everything suggested :slight_smile:
At the moment I'm feeding from educational material on Rust to understand it better.

Once confident enough I'll see if the findings here should be filed as a bug on Vector's performance or if it's normal to go through alloca, unsafe or other generic-array based workarounds.