#[inline(always)]
pub fn div_array<const N: usize, T>(a: &[T; N], b: &[T; N]) -> [T; N]
where
T: Default + Copy + Div<Output=T>,
{
let mut p = [T::default(); N];
for i in 0..N {
unsafe {
*(p.get_unchecked_mut(i)) = *(a.get_unchecked(i)) / *(b.get_unchecked(i));
}
}
p
}
fn main(){
let a = [1.0;100];
let b = [2.0;100];
let c = div_array(&a, &b);
}
it's very fast when array is small,
when the len(a) == 10, cost 4.4ns
but when the len(a) == 100, cost 91.375ns in my cpu
test code
let array1 = [2.0; 10];
let array2 = [1.0; 10];
c.bench_function("div", |b| {
b.iter(move || {
div_array(black_box(&array1), black_box(&array2))
})
});
What is up with all the unsafe? Please only use unsafe when there’s measurable performance benefits. In this case, as far as I can tell, the function compiles to the exact same assembly if the safe p[i] = a[i] / b[i]; is used, so obviously you could not have had any measurable performance benefits that could justify the unsafety.
Compiler optimizations will do compiler optimizations, and thus in this case compiler optimizations will apparently be able to eliminate bounds checks for index i in a loop over 0..N, indexing a fixed-size array of length N.
Besides this, in some other cases, there’s also the possibility of eliminating indexing bounds checks by using iterators, e.g. in this case iterators could be used with zip
let mut p = [T::default(); N];
p.iter_mut().zip(a).zip(b).for_each(|((z, x), y)| *z = *x / *y);
p
which seems to result in identical assembly yet again, in this case.
Without being much of a SIMD expert myself, the assembly does look pretty good to me already, doing a bit of loop unrolling and using SIMD instructions at all, so that each iteration can do 4 divisions, using 2 SIMD division instructions, each operating on 2 numbers in parallel.
Speaking of parallel, if you want to divide a lot of numbers quickly, parallel computing could be a useful direction to expand into. 100 is not a lot in this context, I assume, but if an actual use-case of yours is a lot larger (or involves lots of those 100-element arrays), looking into rayon’s parallel iterators could help.
length will not exceed 100, i think i have to try another way to speed up the whole calculation.
···
let mut p = [T::default(); N];
··· maybe can be reuse, it's no need to allocate it in every function call.
i try easy simd with f64x8 . but do not work faster than loop
To get the compiler to automatically generate code using wider SIMD registers, you could compile using non-cross-platform optimizations, as enabled e.g. by using the compilation flag “-C target-cpu=native”. Seems to switch to some kind of 256bit registers on godbolt (and also unrolled the whole loop then for the size 100).
Honestly, this function is so straightforward that it may very well simply be the case that the compiler-optimized solution is already essentially optimal.
How did you get the message allocate [0.0;100]? Stack allocation is no more than single in-register integer addition while 90ns is reasonable timescale for single heap allocation. Are you boxing the returned array, or using some sanitizers?
That's interesting, since ABI-wise returning large nonprimitive type is not much different from taking reference and modify it.
You can see within the function the only difference is the single mov at the top which will likely be no-op due to the register renaming. At the call site you need to initialize the value before passing reference while it's not needed to receive return value, but beside to call memset for zero init they're same.
AFAICS, the small case isn't even dividing at all, just filling in the end result directly. So the reason that 100 is more then 10x slower is because it isn't optimized away at compile time.