What code does packed_simd generate for the unsupported vector width?

Disclaimer: I just started to dive into SIMD.

I think my MacBook Pro doesn't support AVX-512:

$ sysctl -a
...
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
...
hw.optional.avx512f: 0
hw.optional.avx512cd: 0
hw.optional.avx512dq: 0
hw.optional.avx512bw: 0
hw.optional.avx512vl: 0
hw.optional.avx512ifma: 0
hw.optional.avx512vbmi: 0
...

However, the following code buids and runs fine:

use packed_simd::f64x8;
fn main() {
    let a = f64x8::splat(0.0);
    let b = f64x8::splat(0.0);
    println!("{}",  (a + b).sum());
}
$ cat .cargo/config
[build]
rustflags = ["-C", "target-cpu=native"]

I wonder if it's compiled into using narrower vectors or scalars? In other words, when using packed_simd, should I not worry and choose width the most suitable for my task, or should I still care about target features to get most of it?

Rust doesn't use SIMD (other than most basic SSE2 for x86) unless you enable it with target-cpu or target-feature flags. So you currently probably get no AVX at all.

Thanks for heads up, I'll update the post to mention that I use --target-cpu=native.

In that case the compiler knows your CPU doesn't have AVX-512 and won't use it.

You can see assembly generated at https://rust.godbolt.org/ (be sure to set -O and specific target-cpu flags)

What I'm trying to understand is how exactly compiler won't use it: by using 256/128/whatever available vectors or straight away scalars?

Rust's higher-level SIMD translates to LLVM compiler intrinsics, and then you get whatever LLVM thinks is the optimal implementation. It's best to check actual assembly with godbolt to be sure.

1 Like

I haven't found a way to use packed_simd in godbolt, so I compiled it locally like this instead:

cargo +nightly rustc --release -- --emit asm

I also had to change test program a bit to avoid too smart compiler doing computations compile-time:

#![feature(test)]

use packed_simd::f64x8;

fn main() {
    let a = core::hint::black_box(f64x8::new(0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0));
    let b = core::hint::black_box(f64x8::new(8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0));
    println!("{:?}", (a + b));
}

And it seems that indeed vector width is downgraded to the next available rather than abandoning SIMD altogether:

...
        vmovaps LCPI5_0(%rip), %ymm0
        vmovaps %ymm0, 96(%rsp)
        vmovaps LCPI5_1(%rip), %ymm0
        vmovaps %ymm0, 64(%rsp)
        leaq    64(%rsp), %rax
        ## InlineAsm Start
        ## InlineAsm End
        vmovapd 64(%rsp), %ymm0
        vmovapd 96(%rsp), %ymm1
        vmovaps LCPI5_2(%rip), %ymm2
        vmovaps %ymm2, 160(%rsp)
        vmovaps LCPI5_3(%rip), %ymm2
        vmovaps %ymm2, 128(%rsp)
        leaq    128(%rsp), %rax
        ## InlineAsm Start
        ## InlineAsm End
        vaddpd  128(%rsp), %ymm0, %ymm0
        vaddpd  160(%rsp), %ymm1, %ymm1
        vmovapd %ymm1, 224(%rsp)
        vmovapd %ymm0, 192(%rsp)
...