for context, i'm working on a simd abstraction (i can't use core::simd
, cause it's unstable).
the issue is, x86 doesn't have 2 wide vectors.
so the question becomes, how do you get the 8 bytes of an F32x2 into a 16 byte __m128
.
one way to do that is using _mm_load_sd
.
the problem is, _mm_load_sd
clears the upper half of the vector to zero.
and that's a problem, cause when chaining operations like a + b + c
,
the temporary a + b
is stored to memory and loaded again using _mm_load_sd
.
the optimizer obviously gets rid of the memory loads/stores,
but it's not smart enough to get rid of the zeroing of the upper half of the vector,
even though the final store to memory doesn't use the upper half. (perhaps because it can affect flags or exceptions, idk)
in practice, this results in movq xmm{i}, xmm{i}
littered throughout the code, which clears the upper half.
another way to get the F32x2
into the __m128
is using MaybeUninit
+ transmute
:
pub fn f32_add(a: [f32; 2], b: [f32; 2]) -> [f32; 2] { unsafe {
let a = [a, std::mem::MaybeUninit::uninit().assume_init()];
let a: __m128 = std::mem::transmute(a);
let b = [b, std::mem::MaybeUninit::uninit().assume_init()];
let b: __m128 = std::mem::transmute(b);
let r = _mm_add_ps(a, b);
std::mem::transmute(_mm_cvtsd_f64(_mm_castps_pd(r)))
}}
this works.
pub fn f32_add3(a: [f32; 2], b: [f32; 2], c: [f32; 2]) -> [f32; 2] {
f32_add(f32_add(a, b), c)
}
/* generates
example::f32_add3:
movq xmm0, rdi
movq xmm1, rsi
addps xmm1, xmm0
movq xmm0, rdx
addps xmm0, xmm1 // no `movq xmm1, xmm1` before this add.
movq rax, xmm0
ret
*/
now, the question is, is this kind of usage of MaybeUninit
valid?
since i'm working with intrinsics directly, which aren't really defined by the compiler, maybe some of the UB constraints don't hold?
although, to be fair, i'm constructing an [f32; 2]
using that assume_init()
.