Rust performance help (convolution)

@rusty_ron I managed to get the simdeez to at least compile and run after much head scratching but I don't get correct results out of it:

    use simdeez::avx2::*;
    use simdeez::scalar::*;
    use simdeez::sse2::*;
    use simdeez::sse41::*;

    simd_compiletime_generate!(
        pub fn re_re_conv_f32(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
            let len = sample.len() - coeff.len() + 1;
            let mut result: Vec<f32> = Vec::with_capacity(len);

            for i in (0..len).step_by(S::VF32_WIDTH) {
                let mut acc = S::set1_ps(0.0);

                for j in (0..coeff.len()).step_by(S::VF32_WIDTH) {
                    let s = S::loadu_ps(&sample[i + j]);
                    let c = S::loadu_ps(&coeff[j]);
                    acc = S::fmadd_ps(s, c, acc);
                }

                let sum = S::horizontal_add_ps(acc);
                result.push(sum);
            }

            // Remaining values
            for i in (len + 1) - len % S::VF32_WIDTH..len {
                   let sum = (0..coeff.len()).map(|j| sample[i + j] * coeff[j]).sum();
                   result.push(sum);
            }
            result
        }
    );

Problem is the "for i" loop does not go around enough times because it is stepping by S::VF32_WIDTH every time and so produces a result vector that is far too short.

If I change the loop to this: "for i in (0..len) {..." then I do get correct results. But the performance is terrible:

test tests::bench_re_re_conv_alice       ... bench:       6,085 ns/iter (+/- 296)
test tests::bench_re_re_conv_bjorn3      ... bench:      11,827 ns/iter (+/- 277)
test tests::bench_re_re_conv_dodomorandi ... bench:       9,332 ns/iter (+/- 187)
test tests::bench_re_re_conv_pcpthm      ... bench:       6,060 ns/iter (+/- 286)
test tests::bench_re_re_conv_rusty_ron   ... bench:      12,840 ns/iter (+/- 293)
test tests::bench_re_re_conv_zicog       ... bench:       5,076 ns/iter (+/- 317)
test tests::bench_re_re_conv_zicog_fast  ... bench:       5,034 ns/iter (+/- 453)
test tests::bench_re_re_conv_zicog_safe  ... bench:       9,150 ns/iter (+/- 263)
test tests::bench_re_re_conv_zso         ... bench:      27,308 ns/iter (+/- 834)

Any ideas?

rusty_ron does not run on an ARM thanks to the lack of NEON in simdeez. But as our OP posted some ARM timings I thought I'd run all these convolutions on a JetsonNano:

test tests::bench_re_re_conv_alice       ... bench:      23,919 ns/iter (+/- 123)
test tests::bench_re_re_conv_bjorn3      ... bench:      48,014 ns/iter (+/- 124)
test tests::bench_re_re_conv_dodomorandi ... bench:      57,550 ns/iter (+/- 810)
test tests::bench_re_re_conv_pcpthm      ... bench:      24,320 ns/iter (+/- 97)
test tests::bench_re_re_conv_zicog       ... bench:      23,473 ns/iter (+/- 92)
test tests::bench_re_re_conv_zicog_fast  ... bench:      23,438 ns/iter (+/- 77)
test tests::bench_re_re_conv_zicog_safe  ... bench:      57,850 ns/iter (+/- 226)
test tests::bench_re_re_conv_zso         ... bench:      62,344 ns/iter (+/- 184)

We are a little slower than clang on the ARM:

$ time target/release/convolution
    119.044853

real    0m6.742s
user    0m6.532s
sys     0m0.120s
dlinano@jetson-nano:~/convolution$ time ./convolution
119.044853

real    0m5.739s
user    0m5.596s
sys     0m0.112s

You're right about the vector lengths. Fixed that. Get around 3x performance using SIMD on the inner loop. Next thing I would try is ISPC to see if that gives a further improvement; but that's definitely not ARM.

simd_compiletime_generate!(
    pub fn re_re_conv_f32(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
        let len = sample.len() - coeff.len() + 1;
        let mut result: Vec<f32> = Vec::with_capacity(len);
        for i in 0..len {
            let mut acc = S::set1_ps(0.0);
            if coeff.len() % S::VF32_WIDTH == 0 {
                for j in (0..coeff.len()).step_by(S::VF32_WIDTH) {
                    let s = S::loadu_ps(&sample[i + j]);
                    let c = S::loadu_ps(&coeff[j]);
                    acc = S::fmadd_ps(s, c, acc);
                }
                let sum = S::horizontal_add_ps(acc);
                result.push(sum);
            } else {
                for j in (coeff.len() + 1)..coeff.len() % S::VF32_WIDTH {
                    let sum = (0..coeff.len()).map(|j| sample[i + j] * coeff[j]).sum();
                    result.push(sum);
                }
            }
        }
        result
    }
);


    #[test]
    fn test_convolution() {
    use std::time::*;
    let sample = (0..SAMPLELEN).map(|v| v as f32 * 0.01).collect::<Vec<_>>(); //vec![0.0f32; SAMPLELEN];
    let coeff = (0..COEFFLEN).map(|v| v as f32 * 0.01).collect::<Vec<_>>(); //vec![0.0f32; COEFFLEN];

    let now = Instant::now();
    let result = re_re_conv(&sample, &coeff);
    println!("Duration {}", now.elapsed().as_millis());
    println!("{}  {}", result[0], result[SAMPLELEN - COEFFLEN]);

    let now = Instant::now();
    let result2 = re_re_conv_f32_compiletime(&sample, &coeff);
    assert_eq!(result.len(),result2.len());
    println!("Duration {}", now.elapsed().as_millis());
    println!("{}  {}", result2[0], result2[SAMPLELEN - COEFFLEN]);
}

Lengths of arrays are the same

Running this:-

Duration 13480
4154.1743  249497950
Duration 4810
4154.175   249497900
test tests::test_convolution ... ok

Ah thanks. That checks out nicely.

Unfortunately while it's about 3.7 times faster than zso's original it is 5 times slower that the fastest solutions we have so far:

Using your timing code on a 20 million sample set and kernel of 500, both random numbers between 0.0 and 1.0:

zicog: Duration 1278
119.04485  127.62263
rusty-ron: Duration 6489
119.04483  127.62262
zso: Duration 23875
119.04489  127.622635

And bench marking on a much smaller job:

test tests::bench_re_re_conv_alice       ... bench:       5,981 ns/iter (+/- 203)
test tests::bench_re_re_conv_bjorn3      ... bench:      11,758 ns/iter (+/- 732)
test tests::bench_re_re_conv_dodomorandi ... bench:       9,781 ns/iter (+/- 209)
test tests::bench_re_re_conv_pcpthm      ... bench:       6,063 ns/iter (+/- 96)
test tests::bench_re_re_conv_rusty_ron   ... bench:      13,246 ns/iter (+/- 376)
test tests::bench_re_re_conv_zicog       ... bench:       5,167 ns/iter (+/- 201)
test tests::bench_re_re_conv_zicog_fast  ... bench:       5,176 ns/iter (+/- 169)
test tests::bench_re_re_conv_zicog_safe  ... bench:       9,040 ns/iter (+/- 231)
test tests::bench_re_re_conv_zso         ... bench:      27,234 ns/iter (+/- 1,246)

Great benches!
Here is a code snipet for bench to Rust-C (FFI):

Cargo.toml:

[build-dependencies]
cc = "1.0.*"

build.rs:

fn main() {
    cc::Build::new()
        .file("src/conv.c")
        .flag("-Ofast")
        .flag("-march=native")
        .flag("-funroll-all-loops")
        .compile("libconv.a");
}

src/conv.c:

void c_re_re_conv(float *out, int *out_length, const float *sample, int samplelen, const float *coeff, int coefflen) {
    int outlen = samplelen - coefflen + 1;
    for (int i=0; i<outlen; i++) {
        float acc = 0.;
        for (int j=0; j<coefflen; j++) {
            acc += sample[i + j] * coeff[j];
        }
        out[i] = acc;
    }
    *out_length = outlen;
}

And the Rust code:

#[link(name = "conv", kind = "static")]
extern "C" {
    fn c_re_re_conv(output: *mut f32, outlen: *mut i32, sample: *const f32, samplelen: i32, coeff: *const f32, coefflen: i32);
}

fn re_re_conv_c(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    let mut output: Vec<f32> = Vec::with_capacity(sample.len());
    let mut outlen: i32 = 0;
    unsafe {
        c_re_re_conv(output.as_mut_ptr(), &mut outlen, sample.as_ptr(), sample.len() as i32, coeff.as_ptr(), coeff.len() as i32);
        output.set_len(outlen as usize);
    }
    output
}

@Zso That is a nice ffi example. With the following results when integrated into my bench:

test tests::bench_re_re_conv_alice       ... bench:       5,044 ns/iter (+/- 448)
test tests::bench_re_re_conv_bjorn3      ... bench:      11,632 ns/iter (+/- 438)
test tests::bench_re_re_conv_dodomorandi ... bench:       8,493 ns/iter (+/- 796)
test tests::bench_re_re_conv_pcpthm      ... bench:       4,869 ns/iter (+/- 383)
test tests::bench_re_re_conv_rusty_ron   ... bench:      13,122 ns/iter (+/- 1,001)
test tests::bench_re_re_conv_zicog       ... bench:       4,300 ns/iter (+/- 237)
test tests::bench_re_re_conv_zicog_fast  ... bench:       4,298 ns/iter (+/- 132)
test tests::bench_re_re_conv_zicog_safe  ... bench:       7,972 ns/iter (+/- 240)
test tests::bench_re_re_conv_zso         ... bench:      26,381 ns/iter (+/- 833)
test tests::bench_re_re_conv_zso_ffi     ... bench:       4,577 ns/iter (+/- 156)

Interestingly Rust comes out faster than C at the end of the day!

This is all a bit pedestrian. When you really feel the need for speed you need to get all your cores on the job.

Enter rayon. If we take alice's pure poetry of a functional style solution above and add only four characters we get this:

    pub fn convolution_parallel(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
        sample
            .par_windows(coeff.len())
            .map(|window| dot_product(window, coeff))
            .collect()
    }

Which performs much better on the 20 million sample size:

zso - c:          Duration 1335ms
119.04485  127.62263
zicog:            Duration 1270ms
119.04485  127.62263
rusty-ron:        Duration 6431ms
119.04483  127.62262
zso:              Duration 23753ms
119.04489  127.622635
alice - serial:   Duration 1330ms
119.04485  127.62263
alice - parallel: Duration 340ms
119.04485  127.62263

On my old 4 core x86-64 box. An almost perfect scaling of performance with cores.

70 times faster that Zso's original C code!

Not bad.

4 Likes

By way of completeness and as Zso asked for a solution that works on ARM aarch64 here are the results on my Jetson Nano:

The original C code:

$ clang -Ofast  convolution.c -o convolution
$ time ./convolution
119.044853

real    0m5.629s
user    0m5.584s
sys     0m0.028s

A selection of the Rust solutions discussed in this thread:

zso - c:          Duration 7301ms
119.04483  127.62262
zicog:            Duration 7049ms
119.04485  127.62263
zso:              Duration 34187ms
119.04489  127.622635
alice - serial:   Duration 7586ms
119.04485  127.62263
alice - parallel: Duration 2578ms
119.04485  127.62263

For some reason the "zso - c" running under Rust/ffi is not as quick as running the compiled C directly. Perhaps we can put this down to different clang versions? Thinks.... actually it might be because my test initializes the sample and coeff arrays with random numbers between 0.0 and 1.0.

My "naive" procedural Rust solution out runs the original C nicely :slight_smile:

Sadly going parallel with rayon on ARM does not scale as well as on the x86_64. Only a speed up of only about 14 over for cores rather that 70 :frowning:

I have noticed that going parallel on ARM does not scale as well as x86 before. Is that a Rust thing or a Rayon thing? Is there anything we can do about that?

Given that alice's solution was such poetry, performed well, and the amazing ease with which it could be parallelized I'm finally starting to be sold on the functional programming style :slight_smile:

5 Likes

That's huge! :slight_smile:

I hope I can keep up with the latest programming trends in the next couple of decades as well as you have here - but I doubt it!
Good effort, hope you enjoy a lot more functional rust code parallelized with Rayon. It almost seems to good to be true (when it compiles :smiley:)

1 Like

I'm glad you liked it :slight_smile:

3 Likes

Yes, it is huge. It has taken a whole year of Rusting and discussion here for me to start to warm up to the functional style. Thank's to alice's persistence and showing it can be done in a concise, readable manner, which nobody else managed to do.

Keep at it. if I can do it I'm pretty sure anyone can.

1 Like

I certainly like it.

I just wonder how come all the other proponents of the functional style that have popped up with suggestions on my threads could not produce such elegant solutions. They were not good at selling the idea with their verbose contortions.

The killer feature is of course practical utility, never mind programming style, aesthetics and syntactical preferences. Being able to get all that multi-core performance so easily is amazing.

Previously I might have attempted such things in C, parallelizing "for" loops with OpenMP say. Which is kind of messy.

This reminds me of the quote...

I didnโ€™t have time to write a short letter, so I wrote a long one instead.

Sometimes it's hard to eloquently phrase what you're trying to do using functional programming, so it turns into an ugly mass of chained ".windows.map.iter.iter.zip.map.fold.collect, lambdas and unsafes".

Procedural code has similar issues where you'll throw more conditions at a problem or do funny contortions with i or mutation of temporaries, but remember you've been writing procedural code for decades and have learned to read the underlying intent or better ways of doing things.

Trying out new paradigms is always mind-bending stuff. You get stuck in the Blub paradox thinking it's got all these unnecessary weird constructs, and then it starts to click and you realise that certain problems lend themselves well to a particular way of thinking/coding... At least that's what I felt when learning functional programming (Haskell) or traditional OO (C#), anyway.

4 Likes

I don't buy the "blub paradox" thing.

I get the idea that new improved languages can have features that people have never seen and hence don't see the utility of and hence dismiss it all.

On the other hand I'm not convinced it's a given that all new languages and/or their new features are necessarily and improvement. See Java.

2 Likes

If anyone is interested in seeing all the solutions to our OP's question proposed here, put together and bench marked, I have put up a repository for it here: rust_convolution: rust_convolution/README.md at master ยท ZiCog/rust_convolution ยท GitHub

If anyone has any suggestions, for performance or clarity, that would be great.

3 Likes

Thank you for GitHUB repository. I sent some pull request. The results:

AArch64 (odroid-c2):

 zso::convolution:             Duration 101840ms
 122.28877  127.02979
 zso::convolution_ffi:         Duration 13666ms
 122.28871  127.029785
 zso::convolution_ffi_vreg:    Duration 7481ms   <-- register optimized SSE/NEON
 122.28871  127.029785
 bjorn3::convolution:          Duration 82392ms
 122.2887  127.029785
 dodomorandi::convolution:     Duration 94015ms
 122.28877  127.02979
 pcpthm::convolution:          Duration 17688ms
 122.28871  127.029785
 zicog::convolution:           Duration 15706ms
 122.28871  127.029785
 zicog::convolution_fast:      Duration 15705ms
 122.28871  127.029785
 zicog::convolution_safe:      Duration 70595ms
 122.28877  127.02979
 alice::convolution_serial:    Duration 17688ms
 122.28871  127.029785
 alice::convolution_parallel:  Duration 4571ms
 122.28871  127.029785

x86_64 (i5-3337u):

zso::convolution:             Duration 33089ms
122.28877  127.02979
zso::convolution_ffi:         Duration 1610ms
122.28871  127.029785
zso::convolution_ffi_vreg:    Duration 1187ms   <-- register optimized SSE/NEON
122.28871  127.029785
zso::convolution_ffi_avx:     Duration 1065ms   <-- register optimized AVX
122.28871  127.02979
bjorn3::convolution:          Duration 8472ms
122.2887  127.029785
dodomorandi::convolution:     Duration 11011ms
122.28877  127.02979
pcpthm::convolution:          Duration 1593ms
122.28871  127.029785
zicog::convolution:           Duration 1581ms
122.28871  127.029785
zicog::convolution_fast:      Duration 1569ms
122.28871  127.029785
zicog::convolution_safe:      Duration 11010ms
122.28877  127.02979
alice::convolution_serial:    Duration 1598ms
122.28871  127.029785
alice::convolution_parallel:  Duration 762ms
122.28871  127.029785
1 Like

Thanks for the interesting pull requests. I have some results as well:

Nvidia Jetson Nano:

zso::convolution:             Duration 34442ms
122.28877  127.02979
zso::convolution_ffi:         Duration 7292ms
122.28871  127.029785
zso::convolution_ffi_vreg:    Duration 2793ms
122.28871  127.029785
bjorn3::convolution:          Duration 30239ms
122.2887  127.029785
dodomorandi::convolution:     Duration 34301ms
122.28877  127.02979
pcpthm::convolution:          Duration 7375ms
122.28871  127.029785
zicog::convolution:           Duration 6944ms
122.28871  127.029785
zicog::convolution_fast:      Duration 6945ms
122.28871  127.029785
zicog::convolution_safe:      Duration 34321ms
122.28877  127.02979
alice::convolution_serial:    Duration 7375ms
122.28871  127.029785
alice::convolution_parallel:  Duration 1904ms
122.28871  127.029785

My x86_64 PC (Intel Core i7-2600K)

zso::convolution:             Duration 24407ms
122.28877  127.02979
zso::convolution_ffi:         Duration 1328ms
122.28871  127.02979
zso::convolution_ffi_vreg:    Duration 886ms
122.28871  127.029785
zso::convolution_ffi_avx:     Duration 1033ms
122.28871  127.02979
bjorn3::convolution:          Duration 6778ms
122.2887  127.029785
dodomorandi::convolution:     Duration 8261ms
122.28877  127.02979
pcpthm::convolution:          Duration 1284ms
122.28871  127.029785
zicog::convolution:           Duration 1268ms
122.28871  127.029785
zicog::convolution_fast:      Duration 1263ms
122.28871  127.029785
zicog::convolution_safe:      Duration 8240ms
122.28877  127.02979
alice::convolution_serial:    Duration 1284ms
122.28871  127.029785
alice::convolution_parallel:  Duration 335ms
122.28871  127.029785

These numbers are all over the place. I think we need some nice graphic, a bar chart, to see what is going on.

Why is the multi-core threading so bad on ARM?

My experience with ARM processors:

  • ALU is fast ... NEON is faster, AArch64 has double number of NEON registers than classical 32 bit ARM. The out-of-order type CPU has at least triple number of ALU in background.
  • cache is ok, but if you need data from/to RAM, it is much slower (cache-RAM time ratio) than Intel CPU. For example ARM Cortex A53 -> ARM Cortex A55 the most improvement is the speed of RAM. See: https://bit.ly/31UqKtc
  • heat ... the cooling is ok for average application (Rpi4: need an external heat sink). But if you run continuosly this convolution all of threads, the CPU will turn its clock back.

Iโ€™m still looking for quick tricks for Rust. The latest test on Raspberry Pi4, 64 bit Ubuntu (AArch64), one CPU core, reorder by running time:

zso::convolution:             Duration 27359ms # original, simple Rust
122.28877  127.02979
zicog::convolution_slow:      Duration 27314ms
122.28877  127.02979
bjorn3::convolution:          Duration 26469ms
122.2887  127.029785
zicog::convolution_safe:      Duration 20591ms
122.28877  127.02979
dodomorandi::convolution:     Duration 20585ms
122.28877  127.02979
alice::convolution_serial:    Duration 6514ms
122.28871  127.029785
pcpthm::convolution:          Duration 6514ms
122.28871  127.029785
zicog::convolution:           Duration 6169ms
122.28871  127.029785
zicog::convolution_fast:      Duration 6169ms
122.28871  127.029785
zso::convolution_ffi:         Duration 5369ms # FFI C code (simple code)
122.28871  127.029785
zso::convolution_ffi_vreg:    Duration 2697ms # FFI C code (optimized code)
122.28871  127.029785

Source codes: https://github.com/ZiCog/rust_convolution

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.