Sorry for my carefulless
I firstly check the version that use iterator, which could not be optimized as avx code:
#[inline(never)]
fn dot_product(t:&[i16], s:&[i16])->i32{
t.chunks(2).zip(s.chunks(2)).map(|(t,s)|(t[0] as i32*s[0] as i32+t[1] as i32*s[1] as i32)>>8).sum()
}
If we use for-loop rather than map, the auto vectorization is performed.
I was firstly assume both of the code should generate the same thing since Rust have zero-cose abstractions.
Actually it doesn't.
suppose that you're calculating sum(a[i]*b[i])
a perfect solution is using pmaddwd
intrinsics
#![feature(stdsimd)]
fn dot(a:[i16;32],b:[i16;32])->i32{
let c:[i32;16]=unsafe{core::mem::transmute(core::arch::x86_64::_mm512_madd_epi16(core::mem::transmute(a),core::mem::transmute(b)))};
// suppose that the data never overflow, or we could sacrifice the accuracy to ensure it.
c.into_iter().sum()
}
fn main(){
let mut a:[i16;32]=[0;32];
let mut b:[i16;32]=a;
(0..32).for_each(|x|{a[x]=x as i16;b[x]=x as i16+1});
assert!(31*32*33/3==dot(a,b));
a[0]=32767;
b[0]=2;
assert!(31*32*33/3+32767*2==dot(a,b)) // may overflow with i16*i16 if not use i32 to store the mul step result.
}
which needs avx-512 support, nightly build AND quite unsafe code (transmute
!)
Is there any better method to write down some proper code that call *madd_epi16
automatically?
the original version could be
fn dot(a:[i16;32],b:[i16;32])->i32{
let mut c:[i32;16]=[0;16]
for i in 0..16 {
c[i]=(a[i*2] as i32*b[i*2] as i32+a[i*2+1] as i32*b[i*2+1] as i32) as i32
// c[i] >> CONST // if a sacrifice of accuracy is needed.
}
// suppose that the data never overflow, or we could sacrifice the accuracy to ensure it.
c.into_iter().sum()
}
without all the as i32
except the last one generate the code use madd_epi16
, but without as i32
could not pass the second assert.
Is there any good idea?
Or, since the usage of dot
is quite rare, there's no such way except writedown transmute manually?