You're right about the vector lengths. Fixed that. Get around 3x performance using SIMD on the inner loop. Next thing I would try is ISPC to see if that gives a further improvement; but that's definitely not ARM.
simd_compiletime_generate!(
pub fn re_re_conv_f32(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
let len = sample.len() - coeff.len() + 1;
let mut result: Vec<f32> = Vec::with_capacity(len);
for i in 0..len {
let mut acc = S::set1_ps(0.0);
if coeff.len() % S::VF32_WIDTH == 0 {
for j in (0..coeff.len()).step_by(S::VF32_WIDTH) {
let s = S::loadu_ps(&sample[i + j]);
let c = S::loadu_ps(&coeff[j]);
acc = S::fmadd_ps(s, c, acc);
}
let sum = S::horizontal_add_ps(acc);
result.push(sum);
} else {
for j in (coeff.len() + 1)..coeff.len() % S::VF32_WIDTH {
let sum = (0..coeff.len()).map(|j| sample[i + j] * coeff[j]).sum();
result.push(sum);
}
}
}
result
}
);
#[test]
fn test_convolution() {
use std::time::*;
let sample = (0..SAMPLELEN).map(|v| v as f32 * 0.01).collect::<Vec<_>>(); //vec![0.0f32; SAMPLELEN];
let coeff = (0..COEFFLEN).map(|v| v as f32 * 0.01).collect::<Vec<_>>(); //vec![0.0f32; COEFFLEN];
let now = Instant::now();
let result = re_re_conv(&sample, &coeff);
println!("Duration {}", now.elapsed().as_millis());
println!("{} {}", result[0], result[SAMPLELEN - COEFFLEN]);
let now = Instant::now();
let result2 = re_re_conv_f32_compiletime(&sample, &coeff);
assert_eq!(result.len(),result2.len());
println!("Duration {}", now.elapsed().as_millis());
println!("{} {}", result2[0], result2[SAMPLELEN - COEFFLEN]);
}
Lengths of arrays are the same
Running this:-
Duration 13480
4154.1743 249497950
Duration 4810
4154.175 249497900
test tests::test_convolution ... ok
For some reason the "zso - c" running under Rust/ffi is not as quick as running the compiled C directly. Perhaps we can put this down to different clang versions? Thinks.... actually it might be because my test initializes the sample and coeff arrays with random numbers between 0.0 and 1.0.
My "naive" procedural Rust solution out runs the original C nicely
Sadly going parallel with rayon on ARM does not scale as well as on the x86_64. Only a speed up of only about 14 over for cores rather that 70
I have noticed that going parallel on ARM does not scale as well as x86 before. Is that a Rust thing or a Rayon thing? Is there anything we can do about that?
Given that alice's solution was such poetry, performed well, and the amazing ease with which it could be parallelized I'm finally starting to be sold on the functional programming style
I hope I can keep up with the latest programming trends in the next couple of decades as well as you have here - but I doubt it!
Good effort, hope you enjoy a lot more functional rust code parallelized with Rayon. It almost seems to good to be true (when it compiles )
Yes, it is huge. It has taken a whole year of Rusting and discussion here for me to start to warm up to the functional style. Thank's to alice's persistence and showing it can be done in a concise, readable manner, which nobody else managed to do.
Keep at it. if I can do it I'm pretty sure anyone can.
I just wonder how come all the other proponents of the functional style that have popped up with suggestions on my threads could not produce such elegant solutions. They were not good at selling the idea with their verbose contortions.
The killer feature is of course practical utility, never mind programming style, aesthetics and syntactical preferences. Being able to get all that multi-core performance so easily is amazing.
Previously I might have attempted such things in C, parallelizing "for" loops with OpenMP say. Which is kind of messy.
I didn’t have time to write a short letter, so I wrote a long one instead.
Sometimes it's hard to eloquently phrase what you're trying to do using functional programming, so it turns into an ugly mass of chained ".windows.map.iter.iter.zip.map.fold.collect, lambdas and unsafes".
Procedural code has similar issues where you'll throw more conditions at a problem or do funny contortions with i or mutation of temporaries, but remember you've been writing procedural code for decades and have learned to read the underlying intent or better ways of doing things.
Trying out new paradigms is always mind-bending stuff. You get stuck in the Blub paradox thinking it's got all these unnecessary weird constructs, and then it starts to click and you realise that certain problems lend themselves well to a particular way of thinking/coding... At least that's what I felt when learning functional programming (Haskell) or traditional OO (C#), anyway.
ALU is fast ... NEON is faster, AArch64 has double number of NEON registers than classical 32 bit ARM. The out-of-order type CPU has at least triple number of ALU in background.
cache is ok, but if you need data from/to RAM, it is much slower (cache-RAM time ratio) than Intel CPU. For example ARM Cortex A53 -> ARM Cortex A55 the most improvement is the speed of RAM. See: https://bit.ly/31UqKtc
heat ... the cooling is ok for average application (Rpi4: need an external heat sink). But if you run continuosly this convolution all of threads, the CPU will turn its clock back.