How can i speed up the div?

somewheve · January 5, 2023, 2:29pm

i hope to use a div b and get a new array.

#[inline(always)]
pub fn div_array<const N: usize, T>(a: &[T; N], b: &[T; N]) -> [T; N]
    where
        T: Default + Copy + Div<Output=T>,
{
    let mut p = [T::default(); N];
    for i in 0..N {
        unsafe {
            *(p.get_unchecked_mut(i)) = *(a.get_unchecked(i)) / *(b.get_unchecked(i));
        }
    }
    p
}

fn main(){
   let a = [1.0;100];
   let b = [2.0;100];
   let c = div_array(&a, &b);
}

it's very fast when array is small,
when the len(a) == 10, cost 4.4ns
but when the len(a) == 100, cost 91.375ns in my cpu

test code

  let array1 = [2.0; 10];
  let array2 = [1.0; 10];
  c.bench_function("div", |b| {
        b.iter(move || {
            div_array(black_box(&array1), black_box(&array2))
        })
    });

are there any ways to speed this div_array?

steffahn · January 5, 2023, 3:19pm

What is up with all the unsafe? Please only use unsafe when there’s measurable performance benefits. In this case, as far as I can tell, the function compiles to the exact same assembly if the safe p[i] = a[i] / b[i]; is used, so obviously you could not have had any measurable performance benefits that could justify the unsafety.

somewheve · January 5, 2023, 3:34pm

p[i] = a[i] / b[i] will do bounds check.
is it my misunderstanding?

steffahn · January 5, 2023, 3:40pm

Compiler optimizations will do compiler optimizations, and thus in this case compiler optimizations will apparently be able to eliminate bounds checks for index i in a loop over 0..N, indexing a fixed-size array of length N.

Besides this, in some other cases, there’s also the possibility of eliminating indexing bounds checks by using iterators, e.g. in this case iterators could be used with zip

let mut p = [T::default(); N];
p.iter_mut().zip(a).zip(b).for_each(|((z, x), y)| *z = *x / *y);
p

which seems to result in identical assembly yet again, in this case.

somewheve · January 5, 2023, 3:40pm

i make a test, they are equivalent。 why...

somewheve · January 5, 2023, 3:41pm

got it...

steffahn · January 5, 2023, 3:46pm

Without being much of a SIMD expert myself, the assembly does look pretty good to me already, doing a bit of loop unrolling and using SIMD instructions at all, so that each iteration can do 4 divisions, using 2 SIMD division instructions, each operating on 2 numbers in parallel.

Speaking of parallel, if you want to divide a lot of numbers quickly, parallel computing could be a useful direction to expand into. 100 is not a lot in this context, I assume, but if an actual use-case of yours is a lot larger (or involves lots of those 100-element arrays), looking into rayon’s parallel iterators could help.

somewheve · January 5, 2023, 3:53pm

length will not exceed 100, i think i have to try another way to speed up the whole calculation.
···
let mut p = [T::default(); N];
··· maybe can be reuse, it's no need to allocate it in every function call.

i try easy simd with f64x8 . but do not work faster than loop

steffahn · January 5, 2023, 4:04pm

To get the compiler to automatically generate code using wider SIMD registers, you could compile using non-cross-platform optimizations, as enabled e.g. by using the compilation flag “-C target-cpu=native”. Seems to switch to some kind of 256bit registers on godbolt (and also unrolled the whole loop then for the size 100).

somewheve · January 5, 2023, 4:05pm

i already tried this . lol.

steffahn · January 5, 2023, 4:07pm

Ah, nevermind then if you’re already using it

Honestly, this function is so straightforward that it may very well simply be the case that the compiler-optimized solution is already essentially optimal.

somewheve · January 5, 2023, 4:08pm

allocate [0.0;100] takes up most of the time.

the whole calculation now cost 1.6+x us , allocating a [0.0;100] array costs 90+ns. honestly it's a really painful optimization process

erelde · January 5, 2023, 4:17pm

You can avoid that by passing a third array in the function &mut [T; N].

Though there really isn't much to optimize in the first place here. Especially if there's never going to be more than 100 items.

somewheve · January 5, 2023, 4:18pm

yeah, i tried this now. and works in some case.

Hyeonu · January 5, 2023, 4:21pm

How did you get the message allocate [0.0;100]? Stack allocation is no more than single in-register integer addition while 90ns is reasonable timescale for single heap allocation. Are you boxing the returned array, or using some sanitizers?

somewheve · January 5, 2023, 4:29pm

My description here is ambiguous
the whole function cost 90ns when i allocate array in function.
but when i try pass &mut array, it only cost 44ns.

Hyeonu · January 5, 2023, 4:55pm

That's interesting, since ABI-wise returning large nonprimitive type is not much different from taking reference and modify it.

You can see within the function the only difference is the single mov at the top which will likely be no-op due to the register renaming. At the call site you need to initialize the value before passing reference while it's not needed to receive return value, but beside to call memset for zero init they're same.

cuviper · January 5, 2023, 5:08pm

AFAICS, the small case isn't even dividing at all, just filling in the end result directly. So the reason that 100 is more then 10x slower is because it isn't optimized away at compile time.

somewheve · January 5, 2023, 5:10pm

here is my testing code

   let array1 = [2.0; 100];
    let array2 = [1.0; 100];
    let mut res = [0.0; 100];
    c.bench_function("div", |b| {
        b.iter(move || {
            div_array(black_box(&mut res), black_box(&array1), black_box(&array2))
        })
    });

I'm also curious about the assembly code, I have a feeling that the compiler does an equivalent optimization on this.

somewheve · January 5, 2023, 5:11pm

wow! Is it possible to remind the compiler to optimize?

Topic		Replies	Views
Optimizing division by a constant at runtime help	7	233	March 25, 2025
High performance operations	7	4086	January 12, 2023
Improving performance of slice 2d multiplication (relative to array equivalent) help	15	795	July 10, 2021
Auto-vectorization in Rust help	16	7582	January 12, 2023
Is the usage of simd correct?	4	382	February 1, 2023

How can i speed up the div?

Related topics