How can i speed up the div?

i hope to use a div b and get a new array.

#[inline(always)]
pub fn div_array<const N: usize, T>(a: &[T; N], b: &[T; N]) -> [T; N]
    where
        T: Default + Copy + Div<Output=T>,
{
    let mut p = [T::default(); N];
    for i in 0..N {
        unsafe {
            *(p.get_unchecked_mut(i)) = *(a.get_unchecked(i)) / *(b.get_unchecked(i));
        }
    }
    p
}

fn main(){
   let a = [1.0;100];
   let b = [2.0;100];
   let c = div_array(&a, &b);
}

it's very fast when array is small,
when the len(a) == 10, cost 4.4ns
but when the len(a) == 100, cost 91.375ns in my cpu

test code

  let array1 = [2.0; 10];
  let array2 = [1.0; 10];
  c.bench_function("div", |b| {
        b.iter(move || {
            div_array(black_box(&array1), black_box(&array2))
        })
    });

are there any ways to speed this div_array?

What is up with all the unsafe? Please only use unsafe when there’s measurable performance benefits. In this case, as far as I can tell, the function compiles to the exact same assembly if the safe p[i] = a[i] / b[i]; is used, so obviously you could not have had any measurable performance benefits that could justify the unsafety.

2 Likes

p[i] = a[i] / b[i] will do bounds check.
is it my misunderstanding?

Compiler optimizations will do compiler optimizations, and thus in this case compiler optimizations will apparently be able to eliminate bounds checks for index i in a loop over 0..N, indexing a fixed-size array of length N.

Besides this, in some other cases, there’s also the possibility of eliminating indexing bounds checks by using iterators, e.g. in this case iterators could be used with zip

let mut p = [T::default(); N];
p.iter_mut().zip(a).zip(b).for_each(|((z, x), y)| *z = *x / *y);
p

which seems to result in identical assembly yet again, in this case.

3 Likes

i make a test, they are equivalent。 why...

got it...

Without being much of a SIMD expert myself, the assembly does look pretty good to me already, doing a bit of loop unrolling and using SIMD instructions at all, so that each iteration can do 4 divisions, using 2 SIMD division instructions, each operating on 2 numbers in parallel.

Speaking of parallel, if you want to divide a lot of numbers quickly, parallel computing could be a useful direction to expand into. 100 is not a lot in this context, I assume, but if an actual use-case of yours is a lot larger (or involves lots of those 100-element arrays), looking into rayon’s parallel iterators could help.

length will not exceed 100, i think i have to try another way to speed up the whole calculation.
···
let mut p = [T::default(); N];
··· maybe can be reuse, it's no need to allocate it in every function call.

i try easy simd with f64x8 . but do not work faster than loop

To get the compiler to automatically generate code using wider SIMD registers, you could compile using non-cross-platform optimizations, as enabled e.g. by using the compilation flag “-C target-cpu=native”. Seems to switch to some kind of 256bit registers on godbolt (and also unrolled the whole loop then for the size 100).

i already tried this . lol.

Ah, nevermind then if you’re already using it :slight_smile:

Honestly, this function is so straightforward that it may very well simply be the case that the compiler-optimized solution is already essentially optimal.

allocate [0.0;100] takes up most of the time.

the whole calculation now cost 1.6+x us , allocating a [0.0;100] array costs 90+ns. honestly it's a really painful optimization process

You can avoid that by passing a third array in the function &mut [T; N].

Though there really isn't much to optimize in the first place here. Especially if there's never going to be more than 100 items.

yeah, i tried this now. and works in some case.

How did you get the message allocate [0.0;100]? Stack allocation is no more than single in-register integer addition while 90ns is reasonable timescale for single heap allocation. Are you boxing the returned array, or using some sanitizers?

My description here is ambiguous
the whole function cost 90ns when i allocate array in function.
but when i try pass &mut array, it only cost 44ns.

That's interesting, since ABI-wise returning large nonprimitive type is not much different from taking reference and modify it.

You can see within the function the only difference is the single mov at the top which will likely be no-op due to the register renaming. At the call site you need to initialize the value before passing reference while it's not needed to receive return value, but beside to call memset for zero init they're same.

AFAICS, the small case isn't even dividing at all, just filling in the end result directly. So the reason that 100 is more then 10x slower is because it isn't optimized away at compile time.

here is my testing code

   let array1 = [2.0; 100];
    let array2 = [1.0; 100];
    let mut res = [0.0; 100];
    c.bench_function("div", |b| {
        b.iter(move || {
            div_array(black_box(&mut res), black_box(&array1), black_box(&array2))
        })
    });

I'm also curious about the assembly code, I have a feeling that the compiler does an equivalent optimization on this.

wow! Is it possible to remind the compiler to optimize?