@rusty_ron I managed to get the simdeez to at least compile and run after much head scratching but I don't get correct results out of it:
use simdeez::avx2::*;
use simdeez::scalar::*;
use simdeez::sse2::*;
use simdeez::sse41::*;
simd_compiletime_generate!(
pub fn re_re_conv_f32(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
let len = sample.len() - coeff.len() + 1;
let mut result: Vec<f32> = Vec::with_capacity(len);
for i in (0..len).step_by(S::VF32_WIDTH) {
let mut acc = S::set1_ps(0.0);
for j in (0..coeff.len()).step_by(S::VF32_WIDTH) {
let s = S::loadu_ps(&sample[i + j]);
let c = S::loadu_ps(&coeff[j]);
acc = S::fmadd_ps(s, c, acc);
}
let sum = S::horizontal_add_ps(acc);
result.push(sum);
}
// Remaining values
for i in (len + 1) - len % S::VF32_WIDTH..len {
let sum = (0..coeff.len()).map(|j| sample[i + j] * coeff[j]).sum();
result.push(sum);
}
result
}
);
Problem is the "for i" loop does not go around enough times because it is stepping by S::VF32_WIDTH every time and so produces a result vector that is far too short.
If I change the loop to this: "for i in (0..len) {..." then I do get correct results. But the performance is terrible:
test tests::bench_re_re_conv_alice ... bench: 6,085 ns/iter (+/- 296)
test tests::bench_re_re_conv_bjorn3 ... bench: 11,827 ns/iter (+/- 277)
test tests::bench_re_re_conv_dodomorandi ... bench: 9,332 ns/iter (+/- 187)
test tests::bench_re_re_conv_pcpthm ... bench: 6,060 ns/iter (+/- 286)
test tests::bench_re_re_conv_rusty_ron ... bench: 12,840 ns/iter (+/- 293)
test tests::bench_re_re_conv_zicog ... bench: 5,076 ns/iter (+/- 317)
test tests::bench_re_re_conv_zicog_fast ... bench: 5,034 ns/iter (+/- 453)
test tests::bench_re_re_conv_zicog_safe ... bench: 9,150 ns/iter (+/- 263)
test tests::bench_re_re_conv_zso ... bench: 27,308 ns/iter (+/- 834)
rusty_ron does not run on an ARM thanks to the lack of NEON in simdeez. But as our OP posted some ARM timings I thought I'd run all these convolutions on a JetsonNano:
test tests::bench_re_re_conv_alice ... bench: 23,919 ns/iter (+/- 123)
test tests::bench_re_re_conv_bjorn3 ... bench: 48,014 ns/iter (+/- 124)
test tests::bench_re_re_conv_dodomorandi ... bench: 57,550 ns/iter (+/- 810)
test tests::bench_re_re_conv_pcpthm ... bench: 24,320 ns/iter (+/- 97)
test tests::bench_re_re_conv_zicog ... bench: 23,473 ns/iter (+/- 92)
test tests::bench_re_re_conv_zicog_fast ... bench: 23,438 ns/iter (+/- 77)
test tests::bench_re_re_conv_zicog_safe ... bench: 57,850 ns/iter (+/- 226)
test tests::bench_re_re_conv_zso ... bench: 62,344 ns/iter (+/- 184)
We are a little slower than clang on the ARM:
$ time target/release/convolution
119.044853
real 0m6.742s
user 0m6.532s
sys 0m0.120s
dlinano@jetson-nano:~/convolution$ time ./convolution
119.044853
real 0m5.739s
user 0m5.596s
sys 0m0.112s
You're right about the vector lengths. Fixed that. Get around 3x performance using SIMD on the inner loop. Next thing I would try is ISPC to see if that gives a further improvement; but that's definitely not ARM.
simd_compiletime_generate!(
pub fn re_re_conv_f32(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
let len = sample.len() - coeff.len() + 1;
let mut result: Vec<f32> = Vec::with_capacity(len);
for i in 0..len {
let mut acc = S::set1_ps(0.0);
if coeff.len() % S::VF32_WIDTH == 0 {
for j in (0..coeff.len()).step_by(S::VF32_WIDTH) {
let s = S::loadu_ps(&sample[i + j]);
let c = S::loadu_ps(&coeff[j]);
acc = S::fmadd_ps(s, c, acc);
}
let sum = S::horizontal_add_ps(acc);
result.push(sum);
} else {
for j in (coeff.len() + 1)..coeff.len() % S::VF32_WIDTH {
let sum = (0..coeff.len()).map(|j| sample[i + j] * coeff[j]).sum();
result.push(sum);
}
}
}
result
}
);
#[test]
fn test_convolution() {
use std::time::*;
let sample = (0..SAMPLELEN).map(|v| v as f32 * 0.01).collect::<Vec<_>>(); //vec![0.0f32; SAMPLELEN];
let coeff = (0..COEFFLEN).map(|v| v as f32 * 0.01).collect::<Vec<_>>(); //vec![0.0f32; COEFFLEN];
let now = Instant::now();
let result = re_re_conv(&sample, &coeff);
println!("Duration {}", now.elapsed().as_millis());
println!("{} {}", result[0], result[SAMPLELEN - COEFFLEN]);
let now = Instant::now();
let result2 = re_re_conv_f32_compiletime(&sample, &coeff);
assert_eq!(result.len(),result2.len());
println!("Duration {}", now.elapsed().as_millis());
println!("{} {}", result2[0], result2[SAMPLELEN - COEFFLEN]);
}
Lengths of arrays are the same
Running this:-
Duration 13480
4154.1743 249497950
Duration 4810
4154.175 249497900
test tests::test_convolution ... ok
For some reason the "zso - c" running under Rust/ffi is not as quick as running the compiled C directly. Perhaps we can put this down to different clang versions? Thinks.... actually it might be because my test initializes the sample and coeff arrays with random numbers between 0.0 and 1.0.
My "naive" procedural Rust solution out runs the original C nicely
Sadly going parallel with rayon on ARM does not scale as well as on the x86_64. Only a speed up of only about 14 over for cores rather that 70
I have noticed that going parallel on ARM does not scale as well as x86 before. Is that a Rust thing or a Rayon thing? Is there anything we can do about that?
Given that alice's solution was such poetry, performed well, and the amazing ease with which it could be parallelized I'm finally starting to be sold on the functional programming style
I hope I can keep up with the latest programming trends in the next couple of decades as well as you have here - but I doubt it!
Good effort, hope you enjoy a lot more functional rust code parallelized with Rayon. It almost seems to good to be true (when it compiles )
Yes, it is huge. It has taken a whole year of Rusting and discussion here for me to start to warm up to the functional style. Thank's to alice's persistence and showing it can be done in a concise, readable manner, which nobody else managed to do.
Keep at it. if I can do it I'm pretty sure anyone can.
I just wonder how come all the other proponents of the functional style that have popped up with suggestions on my threads could not produce such elegant solutions. They were not good at selling the idea with their verbose contortions.
The killer feature is of course practical utility, never mind programming style, aesthetics and syntactical preferences. Being able to get all that multi-core performance so easily is amazing.
Previously I might have attempted such things in C, parallelizing "for" loops with OpenMP say. Which is kind of messy.
I didnโt have time to write a short letter, so I wrote a long one instead.
Sometimes it's hard to eloquently phrase what you're trying to do using functional programming, so it turns into an ugly mass of chained ".windows.map.iter.iter.zip.map.fold.collect, lambdas and unsafes".
Procedural code has similar issues where you'll throw more conditions at a problem or do funny contortions with i or mutation of temporaries, but remember you've been writing procedural code for decades and have learned to read the underlying intent or better ways of doing things.
Trying out new paradigms is always mind-bending stuff. You get stuck in the Blub paradox thinking it's got all these unnecessary weird constructs, and then it starts to click and you realise that certain problems lend themselves well to a particular way of thinking/coding... At least that's what I felt when learning functional programming (Haskell) or traditional OO (C#), anyway.
ALU is fast ... NEON is faster, AArch64 has double number of NEON registers than classical 32 bit ARM. The out-of-order type CPU has at least triple number of ALU in background.
cache is ok, but if you need data from/to RAM, it is much slower (cache-RAM time ratio) than Intel CPU. For example ARM Cortex A53 -> ARM Cortex A55 the most improvement is the speed of RAM. See: https://bit.ly/31UqKtc
heat ... the cooling is ok for average application (Rpi4: need an external heat sink). But if you run continuosly this convolution all of threads, the CPU will turn its clock back.