i'm trying to implement a simd wrapper, because core::simd doesn't work for me.
i need precise control over which specific instructions get generated. i can't have min
generate five instructions, it has to be minps
and minps
only.
the problem is: code-gen is pretty bad for my F32x2
.
F32x2
is obviously only 8 bytes, not an sse vector. meaning, i need to load & store to perform an operation.
#[derive(Clone, Copy, Debug)]
#[repr(transparent)]
pub struct B32x2 (u64);
#[derive(Clone, Copy, Debug)]
#[repr(transparent)]
pub struct F32x2 (f64);
impl F32x2 {
#[inline(always)]
pub fn eq(self, other: F32x2) -> B32x2 {
unsafe {
let a = _mm_castpd_ps(_mm_load_sd(&self.0));
let b = _mm_castpd_ps(_mm_load_sd(&other.0));
let r = _mm_cmpeq_ps(a, b);
let r = _mm_castps_pd(r);
let r = _mm_cvtsd_f64(r);
B32x2(core::mem::transmute(r))
}
}
}
impl core::ops::Add<F32x2> for F32x2 {
type Output = F32x2;
#[inline(always)]
fn add(self, other: F32x2) -> F32x2 {
unsafe {
let a = _mm_castpd_ps(_mm_load_sd(&self.0));
let b = _mm_castpd_ps(_mm_load_sd(&other.0));
let r = _mm_add_ps(a, b);
let r = _mm_castps_pd(r);
let r = _mm_cvtsd_f64(r);
F32x2(r)
}
}
}
sadly llvm gets confused and generates unnecessary moves:
#[inline(never)]
fn test(a: simd::F32x2, b: simd::F32x2) -> simd::B32x2 {
(a + a).eq(b)
}
generates:
00007FF73AC31B50 0F 58 C0 addps xmm0,xmm0
00007FF73AC31B53 F3 0F 7E C0 movq xmm0,xmm0 ; nop
00007FF73AC31B57 F3 0F 7E C9 movq xmm1,xmm1 ; nop
00007FF73AC31B5B 0F C2 C8 00 cmpeqps xmm1,xmm0
00007FF73AC31B5F 66 48 0F 7E C8 movq rax,xmm1
00007FF73AC31B64 C3 ret
a simple dead code elim pass would solve the issue, but i can't really change the compiler.
so, any ideas for what i could do?
i have tried repr(simd)
and some other stuff (like removing the transmute and instead storing an i64 and using castps_si128 with cvtsi128_si64); none of that worked. pretty sure the problem comes from the _mm_load_sd
instructions.
and please remember, i need precise control; core::simd is not an option. only overriding those functions that don't do what i want (eg: min) isn't an option either, because 1) i get the same problem with the redundant movq
s as i've outlined above for my F32x2, and 2) there is no guarantee that newer versions of the compiler will generate the same code for those functions that don't currently cause problems.