How to auto-convert `pmaddwd` structure code (e.g., `sum(a[i]*b[i])`) into avx-accelerated simd `pmaddwd` code?

Sorry for my carefulless

I firstly check the version that use iterator, which could not be optimized as avx code:

#[inline(never)]
fn dot_product(t:&[i16], s:&[i16])->i32{
  t.chunks(2).zip(s.chunks(2)).map(|(t,s)|(t[0] as i32*s[0] as i32+t[1] as i32*s[1] as i32)>>8).sum()
}

If we use for-loop rather than map, the auto vectorization is performed.

I was firstly assume both of the code should generate the same thing since Rust have zero-cose abstractions.

Actually it doesn't.


suppose that you're calculating sum(a[i]*b[i])

a perfect solution is using pmaddwd intrinsics

#![feature(stdsimd)]
fn dot(a:[i16;32],b:[i16;32])->i32{
    let c:[i32;16]=unsafe{core::mem::transmute(core::arch::x86_64::_mm512_madd_epi16(core::mem::transmute(a),core::mem::transmute(b)))};
    // suppose that the data never overflow, or we could sacrifice the accuracy to ensure it.
    c.into_iter().sum()
}
fn main(){
    let mut a:[i16;32]=[0;32];
    let mut b:[i16;32]=a;
    (0..32).for_each(|x|{a[x]=x as i16;b[x]=x as i16+1});
    assert!(31*32*33/3==dot(a,b));
    a[0]=32767;
    b[0]=2;
    assert!(31*32*33/3+32767*2==dot(a,b)) // may overflow with i16*i16 if not use i32 to store the mul step result.
}

which needs avx-512 support, nightly build AND quite unsafe code (transmute!)

Is there any better method to write down some proper code that call *madd_epi16 automatically?

the original version could be

fn dot(a:[i16;32],b:[i16;32])->i32{
    let mut c:[i32;16]=[0;16]
    for i in 0..16 {
        c[i]=(a[i*2] as i32*b[i*2] as i32+a[i*2+1] as i32*b[i*2+1] as i32) as i32
//        c[i] >> CONST // if a sacrifice of accuracy is needed.
    }
    // suppose that the data never overflow, or we could sacrifice the accuracy to ensure it.
    c.into_iter().sum()
}

without all the as i32 except the last one generate the code use madd_epi16, but without as i32 could not pass the second assert.

Is there any good idea?
Or, since the usage of dot is quite rare, there's no such way except writedown transmute manually?

The version using unsafe does a lot of copies and generates roughly 3.4x more instructions than the completely safe for-loop version:

The latter version uses AVX-512 instructions when the appropriate compiler codegen flags are provided. The auto-vectorizer is pretty good!

It can probably be optimized more, but I just copy-pasted your code verbatim (sans a minor typo with a missing semicolon) and these are the results I see.

edit: Actually, removing the -C target-feature=avx512bw,avx512vl flag makes both versions produce the same code. And it uses the zmm registers instead of ymm, resulting in even fewer instructions:

3 Likes

Its worth noting that instead of manually reimplementing such a function (for dot product in particular) you have the DPPD/VDPPD, DPPS/VDPPS, VDPBF16PS, VP4DPWSSDS, and VP4DPWSSD instructions. Not sure if all of them are implemented in Rust (they should be though), but their associated intrinsics are _mm_dp_pd, _mm_dp_ps/_mm256_dp_ps, _mm_dpbf16_ps/_mm_mask_dpbf16_ps/_mm_maskz_dpbf16_ps/_mm256_dpbf16_ps/_mm256_mask_dpbf16_ps/_mm256_maskz_dpbf16_ps/_mm512_dpbf16_ps/_mm512_mask_dpbf16_ps/_mm512_maskz_dpbf16_ps, _mm512_4dpwssds_epi32/_mm512_mask_4dpwssds_epi32/_mm512_maskz_4dpwssds_epi32, and _mm512_4dpwssd_epi32/_mm512_mask_4dpwssd_epi32/_mm512_maskz_4dpwssd_epi32. At least on x86. I'm not sure how to properly use the AVX-512 instructions, but looking at the documentation for DPPD/VDPPD and DPPS/VDPPS it looks as though you simply set a bit mask to tell the instruction how to calculate the dot product. For DPPD (and its vectored version), you can set either bits 4 or 5, which specifies which packed doubles to perform the operation on, and then use bits 0 and 1 to tell the processor where to store the final (64-bit) result. For DPPS, you use bits 4-7 for the floats to include, and bits 0-3 to tell the processor where to place the final result. You could broadcast the result to the entire vector by setting bits 0-3, but that would be a bit redundant. Either way, once you've finished that operation you just once again extract the result, which would be an f32. I don't know how fast those instructions are compared to the normal (more portable) version, but I thought I'd note that instructions for these operations do exist. (Then again, x86 has a ton of instructions that people seem to not use, which is kinda weird...)

Thanks for your check and sorry for my carefulless

I firstly check the version that use iterator, which could not be optimized as avx code:

#[inline(never)]
fn dot_product(t:&[i16], s:&[i16])->i32{
  t.chunks(2).zip(s.chunks(2)).map(|(t,s)|(t[0] as i32*s[0] as i32+t[1] as i32*s[1] as i32)>>8).sum()
}

If we use for-loop rather than map, the auto vectorization is performed.

I was firstly assume both of the code should generate the same thing since Rust have zero-cose abstractions.

Actually it doesn't.

The thing you may have missed is that nothing says the inputs are of even length. So that iterator version can panic in the closure. I suspect the for loop version was ignoring the last item if the lengths were uneven.

If you want to also ignore the last item in the iterator version in that scenario, you might be interested in https://doc.rust-lang.org/nightly/std/primitive.slice.html#method.chunks_exact.

(Also, what's your for loop version you're comparing to? The one I see is fn dot(a:[i16;32], b:[i16;32]) which is completely different from one taking two slices, since the lengths are completely know.)

2 Likes