How to get good code-gen with core::arch intrinsics?

i'm trying to implement a simd wrapper, because core::simd doesn't work for me.
i need precise control over which specific instructions get generated. i can't have min generate five instructions, it has to be minps and minps only.

the problem is: code-gen is pretty bad for my F32x2.
F32x2 is obviously only 8 bytes, not an sse vector. meaning, i need to load & store to perform an operation.

#[derive(Clone, Copy, Debug)]
#[repr(transparent)]
pub struct B32x2 (u64);

#[derive(Clone, Copy, Debug)]
#[repr(transparent)]
pub struct F32x2 (f64);

impl F32x2 {
    #[inline(always)]
    pub fn eq(self, other: F32x2) -> B32x2 {
        unsafe {
            let a = _mm_castpd_ps(_mm_load_sd(&self.0));
            let b = _mm_castpd_ps(_mm_load_sd(&other.0));
            let r = _mm_cmpeq_ps(a, b);
            let r = _mm_castps_pd(r);
            let r = _mm_cvtsd_f64(r);
            B32x2(core::mem::transmute(r))
        }
    }
}

impl core::ops::Add<F32x2> for F32x2 {
    type Output = F32x2;

    #[inline(always)]
    fn add(self, other: F32x2) -> F32x2 {
        unsafe {
            let a = _mm_castpd_ps(_mm_load_sd(&self.0));
            let b = _mm_castpd_ps(_mm_load_sd(&other.0));
            let r = _mm_add_ps(a, b);
            let r = _mm_castps_pd(r);
            let r = _mm_cvtsd_f64(r);
            F32x2(r)
        }
    }
}

sadly llvm gets confused and generates unnecessary moves:

#[inline(never)]
fn test(a: simd::F32x2, b: simd::F32x2) -> simd::B32x2 {
    (a + a).eq(b)
}

generates:

00007FF73AC31B50 0F 58 C0             addps       xmm0,xmm0  
00007FF73AC31B53 F3 0F 7E C0          movq        xmm0,xmm0  ; nop
00007FF73AC31B57 F3 0F 7E C9          movq        xmm1,xmm1  ; nop
00007FF73AC31B5B 0F C2 C8 00          cmpeqps     xmm1,xmm0  
00007FF73AC31B5F 66 48 0F 7E C8       movq        rax,xmm1  
00007FF73AC31B64 C3                   ret

a simple dead code elim pass would solve the issue, but i can't really change the compiler.

so, any ideas for what i could do?

i have tried repr(simd) and some other stuff (like removing the transmute and instead storing an i64 and using castps_si128 with cvtsi128_si64); none of that worked. pretty sure the problem comes from the _mm_load_sd instructions.

and please remember, i need precise control; core::simd is not an option. only overriding those functions that don't do what i want (eg: min) isn't an option either, because 1) i get the same problem with the redundant movqs as i've outlined above for my F32x2, and 2) there is no guarantee that newer versions of the compiler will generate the same code for those functions that don't currently cause problems.

2 Likes

i have also tried using an array - same issue.
and i'd prefer using u64/f64 because those get passed in registers.

the solution really seems to be as simple as having a nop-remove pass.
maybe i could have rust generate object files and then perform link time optimization? though i don't know how i'd do that.

#[inline(never)]
fn test(a: simd::F32x2, b: simd::F32x2) -> simd::B32x2 {
    (a + a).eq(b + b + a + b)
}

generates:

00007FF6D5701B50 0F 28 D1             movaps      xmm2,xmm1  
00007FF6D5701B53 0F 58 D1             addps       xmm2,xmm1  
00007FF6D5701B56 0F 58 D0             addps       xmm2,xmm0  
00007FF6D5701B59 0F 58 C0             addps       xmm0,xmm0  
00007FF6D5701B5C 0F 58 D1             addps       xmm2,xmm1  
00007FF6D5701B5F F3 0F 7E C0          movq        xmm0,xmm0  
00007FF6D5701B63 F3 0F 7E CA          movq        xmm1,xmm2  
00007FF6D5701B67 0F C2 C8 00          cmpeqps     xmm1,xmm0  
00007FF6D5701B6B 66 48 0F 7E C8       movq        rax,xmm1  
00007FF6D5701B70 C3                   ret

so it only seems to happen in eq (also happens for just a.eq(b)).

movq xmm0, xmm0 also isn't really a nop. because it does clear the high lanes (iirc).
but it is still safe to remove the instruction, because the high lanes are never used.

i don't know how llvm handles FP exceptions. that could be another reason why it clears out the upper lanes.
ideally i'd like to tell the compiler: "i don't care about the high lanes"

i guess another non-ideal solution would be to use F32x4 and always load & store manually.
but core::simd doesn't have the problem, so i know this is theoretically possible.

If you need such precise control over the instructions, wouldn’t inline assembly be a better option? Even if you eventually goad LLVM into emitting the right instructions, there’s zero guarantee it will continue to do so in future.

1 Like

well, i don't want to do register allocation by hand.
when you're using ISA intrinsics, rustc had better guarantee that it actually generates those instructions.

i've found a solution, i think:

        use core::mem::MaybeUninit;
        unsafe {
            let mut m: [MaybeUninit::<F32x2m>; 2] = [MaybeUninit::uninit(); 2];
            m[0].write(self);
            core::mem::transmute(m)
        }

this gets rid of the unnecessary movqs. the high lanes really seem to be the issue.

problem is just that it technically has undefined behavior.
can anyone from the compiler team perhaps tell me how bad the UB is in this case? would it be fine to ship this?

You may need to enable higher levels of SSE support in the compiler, because it defaults to the lowest common denominator, e.g. in bash:

RUSTFLAGS="-C target-cpu=native" cargo build --release

or -C target-feature=+sse3,+avx.

no this is SSE2, which is on by default.

my post, which is currently held by the system because it thinks that it's spam has my latest insights.

namely that the high lanes confuse llvm. and i've found a workaround using MaybeUninit, but that technically has UB.

You can always get rustc to do register allocation for you with the inline asm syntax by specifying {some_register} in the string and using some_register = out(xmm_reg) _ (for an out register that is). Inline assembly - The Rust Reference

4 Likes

that's pretty cool, thanks for the tip!

but this really isn't a use case for inline asm. i'm just writing a vector abstraction. for F32x4 things are totally fine (i get exactly the instructions i want).
the issue is F32x2, because the high lanes get zeroed by _mm_load_sd, but i don't actually want that.

wait, never mind. i think it just lost that post somehow.

this code works, but has UB:

use core::arch::x86_64::*;
use core::mem::MaybeUninit;


#[derive(Clone, Copy, Debug)]
#[repr(transparent)]
pub struct B32x2 (f64);

#[derive(Clone, Copy, Debug)]
#[repr(transparent)]
pub struct F32x2 (f64);


impl F32x2 {
    #[inline(always)]
    unsafe fn load(self) -> __m128 {
        let mut m: [MaybeUninit::<F32x2>; 2] = [MaybeUninit::uninit(); 2];
        m[0].write(self);
        core::mem::transmute(m)
    }

    #[inline(always)]
    pub fn eq(self, other: F32x2) -> B32x2 {
        unsafe {
            let r = _mm_cmpeq_ps(self.load(), other.load());
            let r = _mm_castps_pd(r);
            let r = _mm_cvtsd_f64(r);
            B32x2(r)
        }
    }
}

impl core::ops::Add<F32x2> for F32x2 {
    type Output = F32x2;

    #[inline(always)]
    fn add(self, other: F32x2) -> F32x2 {
        unsafe {
            let r = _mm_add_ps(self.load(), other.load());
            let r = _mm_castps_pd(r);
            let r = _mm_cvtsd_f64(r);
            F32x2(r)
        }
    }
}

here's the example again:

#[inline(never)]
fn test(a: simd::F32x2, b: simd::F32x2) -> simd::B32x2 {
    (a + a).eq(b)
}

generates:

00007FF6F6051B50 0F 58 C0             addps       xmm0,xmm0  
00007FF6F6051B53 0F C2 C1 00          cmpeqps     xmm0,xmm1  
00007FF6F6051B57 C3                   ret

@compiler-team, is this ok?

probably not, because the uninit values could cause FP exceptions. i do mask the exceptions. but i guess UB lets the compiler do whatever it wants.
i'm just wondering to what extent that would be in this particular case.
while the UB list does contain uninitialized memory as producing invalid values, there isn't really an invalid value for F32.

Undefined values are useful because they indicate to the compiler that the program is well defined no matter what value is used. This gives the compiler more freedom to optimize.

this part from LLVM Language Reference Manual — LLVM 18.0.0git documentation sounds to me like doing this would be fine.
of course the rust compiler people can't gurantee that this will always work, because they might change the backend and whatever.
but realistically, can this code cause real problems?

actually, it's probably fine. because #![feature(portable_simd)] has the same UB.

this code: core::simd::f32x2::splat(1.0) / core::simd::f32x2::splat(1.0)
actually divides by zero and causes a floating point exception (if not masked).

it doesn't have it at LLVM level though, because it uses LLVM's f32x2.
i guess a f32x4 with the high lanes being undef is equivalent to an LLVM f32x2?

I believe I've found a workaround, I think you have to convince LLVM that the upper floats don't matter for the comparison, otherwise it zeroes them out (godbolt)

use std::arch::x86_64::{
    __m128, _mm_add_ps, _mm_castpd_ps, _mm_castps_pd, _mm_cmpeq_ps, _mm_cvtsd_f64, _mm_load_sd,
    _mm_loadl_pd, _mm_undefined_pd,
};

#[derive(Clone, Copy)]
#[repr(transparent)]
pub struct B32x2(u64);

#[derive(Clone, Copy)]
#[repr(transparent)]
pub struct F32x2(f64);

impl F32x2 {
    #[inline(always)]
    fn load(&self) -> __m128 {
        unsafe { _mm_castpd_ps(_mm_loadl_pd(_mm_undefined_pd(), &self.0)) }
    }

    #[inline(always)]
    pub fn eq(self, other: F32x2) -> B32x2 {
        let (a, b) = (self.load(), other.load());

        unsafe {
            let r = _mm_cmpeq_ps(a, b);
            let r = _mm_castps_pd(r);
            let r = _mm_cvtsd_f64(r);
            B32x2(core::mem::transmute(r))
        }
    }
}

impl core::ops::Add<F32x2> for F32x2 {
    type Output = F32x2;

    #[inline(always)]
    fn add(self, other: F32x2) -> F32x2 {
        let (a, b) = (self.load(), other.load());

        unsafe {
            let r = _mm_add_ps(a, b);
            let r = _mm_castps_pd(r);
            let r = _mm_cvtsd_f64(r);
            F32x2(r)
        }
    }
}

pub fn add(a: F32x2, _b: F32x2) -> F32x2 {
    a + a
}

pub fn eq(a: F32x2, b: F32x2) -> B32x2 {
    a.eq(b)
}

pub fn add_eq(a: F32x2, b: F32x2) -> B32x2 {
    (a + a).eq(b)
}

example::add:
        addps   xmm0, xmm0
        ret

example::eq:
        cmpeqps xmm0, xmm1
        movq    rax, xmm0
        ret

example::add_eq:
        addps   xmm0, xmm0
        cmpeqps xmm0, xmm1
        movq    rax, xmm0
        ret

But this may also be ub due to the same FP exceptions. Honestly, I think this may just be unavoidable, the intrinsics (_mm_load_sd() specifically) are doing exactly what they're supposed to do. If you truly need control over each and every instruction generated you pretty much have to use inline assembly since otherwise the compiler is free to do with your code as it wills. In the grand scheme of things two moves probably don't actually matter irt performance so I'd honestly just stick this particular thing into a github issue on your repo and work on the rest of your library, I find that I often fixate on little details of a library that in the grander scope don't really matter

3 Likes

hey Kixiron, thanks for pointing out _mm_undefined_pd!
this looks like the "perfect" solution, because all the "UB" is at intrinsics/llvm level. _mm_undefined_pd clearly isn't an instruction, but seems to be a way to tell the compiler that certain values can be arbitrary.

sadly, wasm & aarch64 don't have an equivalent intrinsic. so i'll use my solution with MaybeUninit. that does suck a bit, because it is UB at rust level. and in rust, simply "producing" an uninitialized value is undefined behavior. i'll just have to rely on tests, which is a good idea anyway.

a few more things:

1

yes, they do: if you think about it, it's weird that there are only two. because every operation has the _mm_load_sd, so you'd expect to see four movqs: two for the add, two for the compare. and in fact that does happen for other intrinsics (like sqrtps). the reason it doesn't happen for _mm_add_ps is because it gets translated into llvm's f32x2 addition instruction. so any intrinsic that doesn't have an llvm version introduces one move for every operand!

2
i don't think FP exceptions are actually a part of the UB model. because if the _MM_EXCEPT_INEXACT exception isn't masked, most FP ops would trigger UB (eg: something as simple as 1.0 / 10.0).
using a MaybeUninit::uninit() value is undefined behavior, which is sad. i really wish that would just "poison" other values (in a sane way), such that you could do something like what i'm doing here "safely".

3
rust really doesn't seem to take "performance reliability" seriously, which is also sad.
using an SSE3 instruction by default (without an SSE3 target) doesn't cause a compile error, like you would expect. but rather just generates a call to a "polyfill"!
similarly core::simd has all kinds of pitfalls: min handles NaNs by default, which you generally don't want when writing intrinsics code. at least provide a min_fast version that doesn't waste as many cycles. to_int_unchecked::<u32> on x86 generates all kinds of extra instructions, because only to_int_unchecked::<i32> can be mapped to an instruction directly. on other platforms, you'd probably not have that issue. calling that unchecked is very misleading.
these things cause real, measurable problems for portable high performance code.

i know you're not responsible for these problems, but i felt like complaining. maybe i should open a discussion on the rust internals forum.

1 Like

oh, and:
4

no, inline assembly is not the right solution. i'm just trying to create a portable & "reliable" simd wrapper, which rust makes quite hard. reliable meaning: i know that +, min, to_int_unchecked, etc all only get mapped to a single instruction and for every platform, i know which one that is.
what i need is a simd abstraction, so i don't have to duplicate my rendering code once for every platform - nothing crazy.

More specifically, Rust assumes the default FP exception state. In C terms, that means that #pragma STDC FENV_ACCESS is always OFF and cannot be set to ON.

2 Likes

While it's nightly-only right now, if you're trying to do that then you probably just want f32x2 in core::simd - Rust.

come on, bro.
this is literally the first thing i wrote:

If you need precise control over which instructions get generated you will need to use inline assembly. Anything else will make the LLVM (or any other compiler backend) generate whatever it thinks is the most optimal code that behaves the same as what you wrote.

2 Likes