Rust performance help (convolution)

Hi!
I want to implement convolution in Rust, but it is very slow (compared to C).

The Rust and C source code:

fn re_re_conv(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    let outlen = sample.len() - coeff.len() + 1;
    let mut out = Vec::with_capacity(outlen);
    for i in 0..outlen {
        let mut acc: f32 = 0.;
        for j in 0..coeff.len() {
            acc += sample[i + j] * coeff[j];
        }
        out.push(acc);
    }
    out
}

const SAMPLELEN: usize = 20_000_000;
const COEFFLEN: usize = 500;

fn main() {
    let mut sample = Vec::with_capacity(SAMPLELEN);
    let mut coeff = Vec::with_capacity(COEFFLEN);
    // ugly, but no extra time in this test
    unsafe {
        sample.set_len(SAMPLELEN);
        coeff.set_len(COEFFLEN);
    }

    let result = re_re_conv(&sample, &coeff);
    println!("{}  {}", result[0], result[SAMPLELEN - COEFFLEN]);
}

And the C code:

void re_re_conv(float *out, int *out_length, const float *sample, int samplelen, const float *coeff, int coefflen) {
    int outlen = samplelen - coefflen + 1;
    for (int i=0; i<outlen; i++) {
        float acc = 0.;
        for (int j=0; j<coefflen; j++) {
            acc += sample[i + j] * coeff[j];
        }
        out[i] = acc;
    }
    *out_length = outlen;
}

#include <stdio.h>
#include <malloc.h>

#define SAMPLELEN (20*1000*1000)
#define COEFFLEN  500

int main() {
    float *sample = malloc(SAMPLELEN*sizeof(float));
    float *coeff = malloc(COEFFLEN*sizeof(float));

    int result_len;
    float *result = malloc(SAMPLELEN*sizeof(float));

    re_re_conv(result, &result_len, sample, SAMPLELEN, coeff, COEFFLEN);
    printf("%f %f", result[0], result[SAMPLELEN - COEFFLEN]);
}

Compile:

$ rustc -O test_rs.c

$ gcc -Ofast -march=native test_c.c -o test_c

And the running time on Odroid-C2 (ARM64):

$ time ./test_rs --> 72 second (real)

$ time ./test_c --> 14,3 second (real) - 5.0x faster

x86_64 time:

$ time ./test_rs --> 10,95 second (real)

$ time ./test_c --> 1,61 second (real) - 6,8x faster

Can you help me, how can I write 4..5x faster convolution code in Rust?
Target: ARM64 & x86_64.

Update#1: clang

On Odroid-C2 (ARM64): (clang does not support '-march=native')

$ clang -Ofast conv_c.c -o conv_c_clang

$ time ./conv_c_clang --> 15,8 sec, 4,5x faster than Rust.

On x86_64:

$ clang -Ofast -march=native conv_c.c -o conv_c_clang

$ time ./conv_c_clang --> 1,51 sec (x86_64), 7x faster than Rust.

2 Likes

Usually the key to optimizing code like this is to convert your code to use iterators to eliminate bounds checking. You could start by using get_unchecked to confirm that bounds checking is what is causing the slowdown.

I'd also suggest building with cargo.

2 Likes

There is no autovectorization:

       │                 acc += sample[i + j] * coeff[j];
  0,21 │110:   movss  -0xc(%rbx,%rcx,4),%xmm0
  6,32 │       movss  -0x8(%rbx,%rcx,4),%xmm1
  0,36 │       mulss  -0xc(%r12,%rcx,4),%xmm0
       │       mulss  -0x8(%r12,%rcx,4),%xmm1
 18,66 │       addss  %xmm3,%xmm0
       │       movss  -0x4(%rbx,%rcx,4),%xmm2
  0,19 │       mulss  -0x4(%r12,%rcx,4),%xmm2
 23,70 │       addss  %xmm0,%xmm1
  0,28 │       movss  (%rbx,%rcx,4),%xmm3
  0,02 │       mulss  (%r12,%rcx,4),%xmm3
 24,72 │       addss  %xmm1,%xmm2
 24,80 │       addss  %xmm2,%xmm3

Using get_unchecked doesn't help.

The following halved the runtime:

fn re_re_conv(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    let outlen = sample.len() - coeff.len() + 1;
    let mut out = Vec::with_capacity(outlen);
    for i in 0..outlen {
        let mut acc: f32 = 0.;
        for (x, chunk) in coeff.chunks(8).enumerate() {
            acc += chunk.iter().enumerate().map(|(j, &c)| unsafe { sample.get_unchecked(i + x * 8 + j) } * c).sum::<f32>();
        }
        out.push(acc);
    }
    out
}

there is still no autovectorization though.

With the following code:

fn re_re_conv(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    sample
        .windows(coeff.len())
        .map(|window| {
            window
                .iter()
                .zip(coeff.iter())
                .map(|(sample, coeff)| sample * coeff)
                .sum::<f32>()
        })
        .collect()
}

I obtain the same time of the C code compiled with clang. It looks like in this case GCC is able to create a way better vectorized code than LLVM.

Once you write the best code you can, you can only hope that the underlying optimizer do its work...

5 Likes

Try adding RUSTFLAGS="-C target-cpu=native" to enable vector instructions.

1 Like

An auto-vectorization cannot be performed because the floating-point operations are not associative i.e. (x1 + x2) + (x3 + x4) != ((x1 + x2) + x3) + x4.

Fortunately, Rust provides intrinsics for "fast" floating-operations (Nightly compiler only). Namely, fadd_fast and fmul_fast. Note the result can change slightly.

In addition to that, it is helpful to use a constant size array type [f32; COEFFLEN] when possible, especially when the number of elements is not large. The compiler can unroll the inner loop entirely with it.

See the compiled assembly linked below for each version:

3 Likes

TL;DR Once again procedural style out performs the functional style by a wide margin whilst being an order of magnitude easier to understand :slight_smile:

This little problem throws up so many questions:

  1. Could someone explain how dodomorand and pcpthm's example code works?

I'm lost in a maze of twisty little .windows.map.iter.iter.zip.map.fold.collect, lambdas and unsafes, all alike. Having stared at it for half a day I get an inkling as to what is going on. I don't think I have enough IQ points to ever actually write anything like that.

  1. What is fmul_fast and fadd_fast actually doing?

The documentation says nothing useful about them.

  1. What has fmul_fast and fadd_fast got to do with vectorization or not?

I kind of get pcpthm's explanation about the non-associativity of floating point operations. But the original C code surely has the same problem and as far as I can tell is being vectorized and unrolled. And hence pcpthm's Rust and the original C run at the same speed. Why does Rust need fmul_fast and fadd_fast to be able to vectorize?

  1. When you feel the need for speed, why not just write nice simple procedural code for a significant boost?

Being unable to fathom pcpthm's example I decided to have a go myself and came up with this solution:

pub fn re_re_conv_zicog(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    let mut out: Vec<f32> = vec![0.0; sample.len() - coeff.len() + 1];
    
    for i in 0..out.len() {
        let mut acc: f32 = 0.0;
        let window = &sample[i..i + coeff.len()];
        for j in 0..window.len() {
            unsafe {
                acc = fadd_fast(acc, fmul_fast(window[j], coeff[j]));
            } 
        }
        out[i] = acc;
    }
    out
}

There, much easier to comprehend. Not only that but 20% faster. I ran all the solutions presented here through cargo bench:

$ cargo bench
...
test tests::bench_re_re_conv_bjorn3      ... bench:      11,668 ns/iter (+/- 493)
test tests::bench_re_re_conv_dodomorandi ... bench:       8,443 ns/iter (+/- 377)
test tests::bench_re_re_conv_pcpthm      ... bench:       5,140 ns/iter (+/- 367)
test tests::bench_re_re_conv_zicog       ... bench:       4,104 ns/iter (+/- 1,406)
test tests::bench_re_re_conv_zso         ... bench:      26,398 ns/iter (+/- 999)

test result: ok. 0 passed; 0 failed; 6 ignored; 6 measured; 0 filtered out

However I must say that when run on the full twenty million sample size on my PC things are almost neck and neck:

zicog version:

$ time target/release/convolution

real    0m1.408s
user    0m1.281s
sys     0m0.125s

pcbthm version:

$ time target/release/convolution

real    0m1.428s
user    0m1.297s
sys     0m0.125s

But C is still ahead a bit:

$ clang -Ofast convolution.c -o convolution
$ time ./convolution

real    0m1.378s
user    0m1.266s
sys     0m0.094s
1 Like

I can try! Here is @dodomorandi's code:

fn re_re_conv(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    sample
        .windows(coeff.len())
        .map(|window| {
            window
                .iter()
                .zip(coeff.iter())
                .map(|(sample, coeff)| sample * coeff)
                .sum::<f32>()
        })
        .collect()
}

It uses iterators, but I'll explain it as operations that take a list and output a list. The sample.windows(coeff.len()) operation creates the following list:

let len = coeff.len();
[&sample[0..len+0], &sample[1..len+1], &sample[2..len+2], ... ]

Note that the slices in the list above overlap. The map operation takes each slice in the list above, calls the provided closure with that slice as the value of window, and produces a new list using the return values of the closure. Let us take a look at what the closure does when given the slice &sample[k..len+k] for each integer k.

The window.iter() expression turns the slice &sample[k..len+k] into an iterator, so with our list analogy, it does nothing as the slice is already a list. The .zip(coeff.iter()) operation takes the two lists &sample[k..len + k] and coeff, pairing up each item in the lists (note that the two lists have the same length). This means that after zipping we now have the list:

[(sample[k], coeff[0]), (sample[k+1], coeff[1]), ..., (sample[k+len-1], coeff[len-1])]

The next .map operation takes the list above, and replaces each pair with its product:

[sample[k] * coeff[0], sample[k+1] * coeff[1], ..., sample[k+len-1] * coeff[len-1]]

Finally, the .sum call computes the sum of the list above.

sample[k] * coeff[0]
 + sample[k+1] * coeff[1]
 + ...
 + sample[k+len-1] * coeff[len-1]]

Going back a bit, we are inside the .map(|window| ...) closure, and the final value in this closure is the sum described above. This means that each window from the first list is replaced with that sum, resulting in the list:

// nw is the number of windows
let nw = sample.len() - coeff.len() + 1;
[
    sample[0] * coeff[0] + sample[1] * coeff[1] + ... + sample[len-1] * coeff[len-1]],
    sample[1] * coeff[0] + sample[1] * coeff[1] + ... + sample[1+len-1] * coeff[len-1]],
    ...
    sample[nw] * coeff[0] + sample[nw+1] * coeff[1] + ... + sample[nw+len-1] * coeff[len-1]],
]

which is exactly the formula for convolution. In short, the closure in .map(|window| { ... }) computes the dot product of window and coeff. It is perhaps easier to read like this:

// window and coeff must have the same length
fn dot_product(window: &[f32], coeff: &[f32]) -> f32 {
    let list_of_pairs = window.iter().zip(coeff.iter());
    
    let list_of_products = list_of_pairs.map(|(sample, coeff)| sample * coeff);
    
    list_of_products.sum::<f32>()
}

fn re_re_conv(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    sample
        .windows(coeff.len())
        .map(|window| dot_product(window, coeff))
        .collect()
}

They are a hint to the optimizer that it is allowed to optimize the multiplication/addtion in a way that changes the behavior of the code, as long as those changes are only due to floating point being inaccurate.

11 Likes

A big thank you alice. That is more than I have a right to ask for. Your dot_product version certainly pulls things apart and lays them bare to see.

To show of my new found understanding I added my interpretation as comments in the code and and put in the fadd_fast and fmul_fast:

fn dot_product(window: &[f32], coeff: &[f32]) -> f32 {
    // Match up elements from windows vector with those from the coeff vector.
    let list_of_pairs = window.iter().zip(coeff.iter());

    // Multiply sample and coeff from each pair 
    let list_of_products =
        list_of_pairs.map(|(sample, coeff)| unsafe { fmul_fast(*sample, *coeff) });

    // Fold acts on each item in the list and accumulates a single result, in this case the sum.    
    list_of_products.fold(0f32, |acc, value| unsafe { fadd_fast(acc, value) })
}

fn re_re_conv_alice(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    sample
        // Get sequence of windows that "slide" over the sample data
        .windows(coeff.len())

        // Form the dot product of every window in the sequence
        .map(|window| dot_product(window, coeff))

        // Map produces an iterator so we have to assemble a vector from that.
        .collect()
}

On adding it to the bench I find it performs almost as well as the best so far:

test tests::bench_re_re_conv_alice       ... bench:       5,402 ns/iter (+/- 160)
test tests::bench_re_re_conv_bjorn3      ... bench:      11,851 ns/iter (+/- 433)
test tests::bench_re_re_conv_dodomorandi ... bench:       8,484 ns/iter (+/- 777)
test tests::bench_re_re_conv_pcpthm      ... bench:       5,158 ns/iter (+/- 234)
test tests::bench_re_re_conv_zicog       ... bench:       5,170 ns/iter (+/- 395)
test tests::bench_re_re_conv_zicog_safe  ... bench:       9,103 ns/iter (+/- 697)
test tests::bench_re_re_conv_zso         ... bench:      26,915 ns/iter (+/- 1,593)

Which brings me back to my long standing question: Why would anyone want to wrap such a simple formula in such a obfuscated manner?

As it stands the problem stated and solutions presented in this thread are troublesome. I would be embarrassed to show any of it to my C/C++ using friends and colleagues.

They would be totally bewilder

ed by the functional style. They would say that it complicates and obfuscates the simple calculations they have in mind. Of course programmers coming from other languages might be used to it and perhaps anyone can soak it up eventually but it's not a good place to start for hard core C/C++ folks.

They would point and jeer and say Rust is terribly slow. Which it is in this case. Unless...

One can use "unsafe" operations to get things up to speed. But then one will hear the common refrain "If I have to sprinkle "unsafe" all over my code I might as well continue using C"

Ah, so when the docs say "Float addition that allows optimizations based on algebraic rules." it just means the order of operations may be changed around. Which may give different results due to the nature of our floating point maths?

I'm just wondering what C does about this. Does it keep the order of operations when optimizing or does it change things around potentially producing different results?

Finally, a bench marking oddity:

You might notice that in the results here with alices's function included my conv_zicog function shows 5170ns. Previously it was down at 4,104ns.

Turns out that simply adding alice's code slows the existing code down by 20% !

This is repeatable. I have tried it a few times now.

You’re using clang -Ofast, which enables a variety of optimizations at the expense of accuracy and safety... I’m not sure if the unsafe intrinsics in nightly achieve the same compromise. Maybe not? It’d be interesting to compare with and without -Ofast (which is similar to gcc’s -ffast-math).

5 Likes

Oh boy, what?!

I had not paid much attention to that -Ofast. I just naively assumed it was like -O3. It most certainly is not!

$ clang -Ofast convolution.c -o convolution
$ time ./convolution
119.044846

real    0m1.353s
user    0m1.219s
sys     0m0.094s
$ clang -O3 convolution.c -o convolution
$ time ./convolution
119.044891

real    0m8.126s
user    0m8.000s
sys     0m0.125s

Note the slight change in output there.

Meanwhile for GCC:

$ gcc -O3 -ffast-math convolution.c -o convolution
$ time ./convolution
119.044891

real    0m2.291s
user    0m2.188s
sys     0m0.094s
$ gcc -O3 convolution.c -o convolution
$ time ./convolution
119.044891

real    0m2.275s
user    0m2.094s
sys     0m0.172s

-ffast-math makes no difference.

And the current fastest Rust effort:

$ time target/release/convolution
119.04485321044921875000000000000000

real    0m1.461s
user    0m1.297s
sys     0m0.156s

So it turns out that the main point of this thread is about how to get Rust to behave like clang with the -Ofast option. This had not occurred to me.

As such I take back what I said about the use of "unsafe" above.

It's probably time for me to find a new bone to chew on but I just found a magical crate that simplifies this whole problem and is as fast as a fast thing can be: fast_floats: fast_floats - Rust

With that in place I almost start to like the functional style. For example alice's solution boils down to this:

use fast_floats::*;

pub fn dot_product(xs: &[f32], ys: &[f32]) -> f32 {
    xs.iter()
        .zip(ys)
        .fold(Fast(0.), | acc, (&x, &y) | acc + Fast(x) * Fast(y))
        .get()
}

pub fn re_re_conv_alice(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    sample
        .windows(coeff.len())
        .map(|window| dot_product(window, coeff))
        .collect()
}

Which is short and sweet. It pretty much spells out what you want to do as clearly as possible. Pure poetry compared to the line noise of other functional style solutions we have seen here. And significantly there is no "unsafe" to be seen. It's even hard to find an "unsafe" in the fast_float crate.

However...

When you feel the need for speed go procedural style:

    let mut out: Vec<f32> = vec![0.0; sample.len() - coeff.len() + 1];

    for i in 0..out.len() {
        let mut acc: FF32 = Fast(0.0);
        let window = &sample[i..i + coeff.len()];
        for j in 0..window.len() {
            acc += Fast(window[j]) * Fast(coeff[j]);
        }
        out[i] = acc.get();
    }
    out
}

The benchmark results:

test tests::bench_re_re_conv_alice       ... bench:       5,171 ns/iter (+/- 159)
test tests::bench_re_re_conv_bjorn3      ... bench:      11,773 ns/iter (+/- 364)
test tests::bench_re_re_conv_dodomorandi ... bench:       8,519 ns/iter (+/- 775)
test tests::bench_re_re_conv_pcpthm      ... bench:       5,113 ns/iter (+/- 166)
test tests::bench_re_re_conv_zicog       ... bench:       4,339 ns/iter (+/- 150)
test tests::bench_re_re_conv_zicog_fast  ... bench:       4,324 ns/iter (+/- 130)
test tests::bench_re_re_conv_zicog_safe  ... bench:       7,923 ns/iter (+/- 296)
test tests::bench_re_re_conv_zso         ... bench:      26,877 ns/iter (+/- 820)

And for the 20 million sample set on my PC:

$ clang -Ofast convolution.c -o convolution
$ time ./convolution
119.044846

real    0m1.339s
user    0m1.266s
sys     0m0.078s
$ gcc -Ofast convolution.c -o convolution
$ time ./convolution
119.044891

real    0m2.305s
user    0m2.141s
sys     0m0.141s
$ cargo build --release
    Finished release [optimized] target(s) in 0.02s
$ time ./target/release/convolution
119.04485321044921875000000000000000

real    0m1.461s
user    0m1.313s
sys     0m0.141s
2 Likes

@Zso pointed out that the code doesn't always vectorise and even if it does, we cannot enable it in some places and not in others (either in C++ or Rust, the -Ofast or --ffast-math is a blanket all or nothing which usually is not what you want in an application)

For almost any obvious mathematical problem, my goto solution is to recode this in Simdeez for Rust or Agner Fog's Vector Class Library in C++. You can explicitly vectorise and not leave it up to change. Rust is all about explicitness.

Interesting, would you like to volunteer a solution written using Simdeez?

1 Like

Hi @ZiCog, I thought I might give it a go. It's not as clear in terms of code but it does appear to produce the same values (just about)

#[cfg(test)]
mod tests {
use simdeez::avx2::*;
use simdeez::scalar::*;
use simdeez::sse2::*;
use simdeez::sse41::*;
// Original version
fn re_re_conv(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    let outlen = sample.len() - coeff.len() + 1;
    let mut out = Vec::with_capacity(outlen);
    for i in 0..outlen {
        let mut acc: f32 = 0.;
        for j in 0..coeff.len() {
            acc += sample[i + j] * coeff[j];
            // println!("{} {} = {}", sample[i + j], coeff[j], acc);
        }
        out.push(acc);
    }
    out
}

simd_compiletime_generate!(
    pub fn re_re_conv_f32(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
        let len = sample.len() - coeff.len() + 1;
        let mut result: Vec<f32> = Vec::with_capacity(len);
        for i in (0..len).step_by(S::VF32_WIDTH) {
            let mut acc = S::set1_ps(0.0);
            for j in (0..coeff.len()).step_by(S::VF32_WIDTH) {
                let s = S::loadu_ps(&sample[i + j]);
                let c = S::loadu_ps(&coeff[j]);
                acc = S::fmadd_ps(s, c, acc);
            }
            let sum = S::horizontal_add_ps(acc);
            result.push(sum);
        }
        // Remaining values
        for i in (len + 1) - len % S::VF32_WIDTH..len {
            let sum = (0..coeff.len()).map(|j| sample[i + j] * coeff[j]).
            result.push(sum);
            }
        }
        result
    }
);
const SAMPLELEN: usize = 20_000_000;
const COEFFLEN: usize = 500;

#[test]
fn test_convolution() {
    use std::time::*;
    let sample = (0..SAMPLELEN).map(|v| v as f32 * 0.01).collect::<Vec<_>>(); //vec![0.0f32; SAMPLELEN];
    let coeff = (0..COEFFLEN).map(|v| v as f32 * 0.01).collect::<Vec<_>>(); //vec![0.0f32; COEFFLEN];

    let now = Instant::now();
    let result = re_re_conv(&sample, &coeff);
    println!("{}  {}", result[0], result[SAMPLELEN - COEFFLEN]);
    println!("Duration {}", now.elapsed().as_millis());

    let now = Instant::now();
    let result = re_re_conv_f32_compiletime(&sample, &coeff);
    println!("{}  {}", result[0], result.last().unwrap());
    println!("Duration {}", now.elapsed().as_millis());
    }
}

Running this prints

running 1 test
4154.1743  249497950
Duration 11701
4154.175   249497900
Duration 1032
test tests::test_convolution ... ok

Not quite the same values but seems a lot faster. Maybe there's a bug if someone can see it?

Wow, great!

Unfortunately I cannot get it to compile:

$ cargo new rusty_ron --lib
$ cd rusty_ron/
$ vim src/lib.rs
... copy and paste code here... 

$ cat Cargo.toml
[package]
name = "rusty_ron"
version = "0.1.0"
authors = ["zicog <zicog@example.com>"]
edition = "2018"
[dependencies]
simdeez = "1.0.6"
$ cargo test
...
error: unexpected closing delimiter: `)`
  --> src/lib.rs:45:1
   |
39 |             let sum = (0..coeff.len()).map(|j| sample[i + j] * coeff[j]).
   |                                    -- block is empty, you might have not meant to close it
...
45 | );
   | ^ unexpected closing delimiter

error: mismatched closing delimiter: `}`
  --> src/lib.rs:44:5
   |
23 | simd_compiletime_generate!(
   |                           - unclosed delimiter
...
44 |     }
   |     ^ mismatched closing delimiter

error: aborting due to 2 previous errors

error: could not compile `rusty_ron`.

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...

@rusty_ron I managed to get the simdeez to at least compile and run after much head scratching but I don't get correct results out of it:

    use simdeez::avx2::*;
    use simdeez::scalar::*;
    use simdeez::sse2::*;
    use simdeez::sse41::*;

    simd_compiletime_generate!(
        pub fn re_re_conv_f32(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
            let len = sample.len() - coeff.len() + 1;
            let mut result: Vec<f32> = Vec::with_capacity(len);

            for i in (0..len).step_by(S::VF32_WIDTH) {
                let mut acc = S::set1_ps(0.0);

                for j in (0..coeff.len()).step_by(S::VF32_WIDTH) {
                    let s = S::loadu_ps(&sample[i + j]);
                    let c = S::loadu_ps(&coeff[j]);
                    acc = S::fmadd_ps(s, c, acc);
                }

                let sum = S::horizontal_add_ps(acc);
                result.push(sum);
            }

            // Remaining values
            for i in (len + 1) - len % S::VF32_WIDTH..len {
                   let sum = (0..coeff.len()).map(|j| sample[i + j] * coeff[j]).sum();
                   result.push(sum);
            }
            result
        }
    );

Problem is the "for i" loop does not go around enough times because it is stepping by S::VF32_WIDTH every time and so produces a result vector that is far too short.

If I change the loop to this: "for i in (0..len) {..." then I do get correct results. But the performance is terrible:

test tests::bench_re_re_conv_alice       ... bench:       6,085 ns/iter (+/- 296)
test tests::bench_re_re_conv_bjorn3      ... bench:      11,827 ns/iter (+/- 277)
test tests::bench_re_re_conv_dodomorandi ... bench:       9,332 ns/iter (+/- 187)
test tests::bench_re_re_conv_pcpthm      ... bench:       6,060 ns/iter (+/- 286)
test tests::bench_re_re_conv_rusty_ron   ... bench:      12,840 ns/iter (+/- 293)
test tests::bench_re_re_conv_zicog       ... bench:       5,076 ns/iter (+/- 317)
test tests::bench_re_re_conv_zicog_fast  ... bench:       5,034 ns/iter (+/- 453)
test tests::bench_re_re_conv_zicog_safe  ... bench:       9,150 ns/iter (+/- 263)
test tests::bench_re_re_conv_zso         ... bench:      27,308 ns/iter (+/- 834)

Any ideas?

rusty_ron does not run on an ARM thanks to the lack of NEON in simdeez. But as our OP posted some ARM timings I thought I'd run all these convolutions on a JetsonNano:

test tests::bench_re_re_conv_alice       ... bench:      23,919 ns/iter (+/- 123)
test tests::bench_re_re_conv_bjorn3      ... bench:      48,014 ns/iter (+/- 124)
test tests::bench_re_re_conv_dodomorandi ... bench:      57,550 ns/iter (+/- 810)
test tests::bench_re_re_conv_pcpthm      ... bench:      24,320 ns/iter (+/- 97)
test tests::bench_re_re_conv_zicog       ... bench:      23,473 ns/iter (+/- 92)
test tests::bench_re_re_conv_zicog_fast  ... bench:      23,438 ns/iter (+/- 77)
test tests::bench_re_re_conv_zicog_safe  ... bench:      57,850 ns/iter (+/- 226)
test tests::bench_re_re_conv_zso         ... bench:      62,344 ns/iter (+/- 184)

We are a little slower than clang on the ARM:

$ time target/release/convolution
    119.044853

real    0m6.742s
user    0m6.532s
sys     0m0.120s
dlinano@jetson-nano:~/convolution$ time ./convolution
119.044853

real    0m5.739s
user    0m5.596s
sys     0m0.112s