Is `MaybeUninit::uninit().assume_init()` UB when used with intrinsics?

for context, i'm working on a simd abstraction (i can't use core::simd, cause it's unstable).

the issue is, x86 doesn't have 2 wide vectors.
so the question becomes, how do you get the 8 bytes of an F32x2 into a 16 byte __m128.
one way to do that is using _mm_load_sd.
the problem is, _mm_load_sd clears the upper half of the vector to zero.

and that's a problem, cause when chaining operations like a + b + c,
the temporary a + b is stored to memory and loaded again using _mm_load_sd.
the optimizer obviously gets rid of the memory loads/stores,
but it's not smart enough to get rid of the zeroing of the upper half of the vector,
even though the final store to memory doesn't use the upper half. (perhaps because it can affect flags or exceptions, idk)
in practice, this results in movq xmm{i}, xmm{i} littered throughout the code, which clears the upper half.

another way to get the F32x2 into the __m128 is using MaybeUninit + transmute:

pub fn f32_add(a: [f32; 2], b: [f32; 2]) -> [f32; 2] { unsafe {
    let a = [a, std::mem::MaybeUninit::uninit().assume_init()];
    let a: __m128 = std::mem::transmute(a);
    let b = [b, std::mem::MaybeUninit::uninit().assume_init()];
    let b: __m128 = std::mem::transmute(b);
    let r = _mm_add_ps(a, b);
    std::mem::transmute(_mm_cvtsd_f64(_mm_castps_pd(r)))
}}

this works.


pub fn f32_add3(a: [f32; 2], b: [f32; 2], c: [f32; 2]) -> [f32; 2] {
    f32_add(f32_add(a, b), c)
}

/* generates
example::f32_add3:
        movq    xmm0, rdi
        movq    xmm1, rsi
        addps   xmm1, xmm0
        movq    xmm0, rdx
        addps   xmm0, xmm1     // no `movq xmm1, xmm1` before this add.
        movq    rax, xmm0
        ret
*/

now, the question is, is this kind of usage of MaybeUninit valid?
since i'm working with intrinsics directly, which aren't really defined by the compiler, maybe some of the UB constraints don't hold?
although, to be fair, i'm constructing an [f32; 2] using that assume_init().

oops, accidentally posted.
i think it should already make sense.
gonna finish editing, one sec :^D

ok, done.
here's a godbolt for more examples: Compiler Explorer

Yes, MaybeUninit::uninit().assume_init() is always UB if the type needs to be initialized. (I.e., in practice it always is, unless the wrapped type is itself a MaybeUninit). It doesn't matter what function you pass the resulting invalid value into; the mere creation of such an uninitialized value (that's pretending to be initialized) is already UB.

If you run your own code under Miri, it also flags it as UB.

2 Likes

i see..
well, then i guess i have two more questions :^)

  1. is there some way to resolve this issue without UB? (on stable)

  2. does the UB matter? the generated code is correct :woman_shrugging:t3:
    i guess it may not matter, as the rust compiler doesn't do any optimizations, and afaik this shouldn't be UB at llvm level, as the poison values aren't ever observed - i guess?

There's no such thing as "correct" code when there's UB in your source. If you have UB, then you have zero guarantees that the generated code will be "correct" for any other case. It may break violently or silently for your users. It may wipe out their hard drives. Just don't.

Did you enable optimizations at all? This compiles to the expected code for me, and is accepted by Miri.

2 Likes

You could also just use MaybeUninit normally, since you're transmuting anyway. I don't know enough about SIMD to know if that would be sound in practice, but it avoids the instant UB of MaybeUninit::uninit().assume_init()

Compiler Explorer

use core::arch::x86_64::{__m128, _mm_add_ps, _mm_castps_pd, _mm_cvtsd_f64};
use std::mem::MaybeUninit;

pub fn f32_add(a: [f32; 2], b: [f32; 2]) -> [f32; 2] {
    unsafe {
        let a = [MaybeUninit::new(a), MaybeUninit::uninit()];
        let a: __m128 = std::mem::transmute(a);
        let b = [MaybeUninit::new(b), MaybeUninit::uninit()];
        let b: __m128 = std::mem::transmute(b);
        let r = _mm_add_ps(a, b);
        std::mem::transmute(_mm_cvtsd_f64(_mm_castps_pd(r)))
    }
}
1 Like

yes, i had optimizations on.

i've found some interesting things: Compiler Explorer
the f32_add function always generates the correct code, for some reason...

it breaks, once there's an _mm_cmp*_ps (probably more cases).

what about this though?

// hey, we're not creating a type that needs to
// be initialized, right? :P
fn load_maybe_uninit(a: [f32; 2]) -> __m128 { unsafe {
    let a = [MaybeUninit::new(a), MaybeUninit::uninit()];
    std::mem::transmute(a)
}}

hehe, i just found that one too :D
it should avoid the language level ub, i think.
at llvm level this should be fine.

i can't run x86 miri right now. would be interesting to see what it thinks of that transmute.

I would consider @H2CO3's answer much safer, and since it generates the same assembly there's not much reason to not go with it IMO.

1 Like

sadly it doesn't work :L
you can check my previous response (Compiler Explorer)

for

pub fn f32_add_eq(a: [f32; 2], b: [f32; 2], c: [f32; 2]) -> [u32; 2] {
    f32_eq(f32_add(a, b), c)
}
example::f32_add_eq:
        movq    xmm0, rdx
        movq    xmm1, rdi
        movq    xmm2, rsi
        addps   xmm2, xmm1
        movq    xmm1, xmm2      ; bad
        movq    xmm0, xmm0      ; bad
        cmpeqps xmm0, xmm1
        movq    rax, xmm0
        ret

vs with that MaybeUninit trick

example::f32_add_eq:
        movq    xmm0, rdi
        movq    xmm1, rsi
        addps   xmm1, xmm0
        movq    xmm0, rdx
        cmpeqps xmm0, xmm1
        movq    rax, xmm0
        ret

Yes, SIMD vector types require initialized memory, so MaybeUninit::uninit().assume_init() is UB with them.

Intrinsics like _mm256_undefined_ps might say "undefined", but what they mean by that is "initialized but no promise on the exact value". If you look at the codegen for that intrinsic in clang (https://cpp.godbolt.org/z/WzzzeWvf6), you'll see it actually just zero-initializes the register. (Even though LLVM has undef, clang intentionally doesn't use that.)


This is 100% absolutely certainly instant UB. Primitive types and arrays of primitive types definitely cannot be uninitialized.

(LLVM might not be able to always take advantage of that right now, so sometimes it might appear to "work", but that'll soon stop being true.)

3 Likes

Intrinsics like _mm256_undefined_ps might say "undefined", but what they mean by that is "initialized but no promise on the exact value".

yup, this indeed doesn't work either:

fn load_undef(a: [f32; 2]) -> __m128 { unsafe {
    let a = [a, Default::default()];
    let a = std::mem::transmute(a);
    _mm_shuffle_ps::<{_MM_SHUFFLE(0, 1, 0, 1)}>(a, _mm_undefined_ps())
}}

in that GH issue you sent, it said

If the value representation contains any undefined or poison bits, the behavior is undefined.

but the way i'm using the vectors, those undef bits in the upper half of the vector are never returned.
the implementations look something like this (well, this would be the exact function):

pub fn f32_add(a: [f32; 2], b: [f32; 2]) -> [f32; 2] {
    unsafe {
        // written like this, *not* using an `#[inline]` function `load`.
        let a = [MaybeUninit::new(a), MaybeUninit::uninit()];
        let a = std::mem::transmute(a);
        let b = [MaybeUninit::new(b), MaybeUninit::uninit()];
        let b = std::mem::transmute(b);

        let r = _mm_add_ps(a, b);
        // assuming llvm keeps track of undef lane wise,
        // we only have undef in the top 2 lanes.
        // - well yes, that definition said "bits", so it should be bit-wise.
        //  which implies lane wise here, as it's doing f32 adds.

        // returns only the bottom two lanes.
        // - or maybe using the same `[[f32; 2]; 2]` transmute as above.
        //   wait no, that would be ub again :D
        //   cause it's reading those top 2 lanes & initializing f32s with them.
        return std::mem::transmute(_mm_cvtsd_f64(_mm_castps_pd(r)));
    }
}

shouldn't this mean, as long as i don't do something like horizontal add, there should be no UB?

The linked issue is an LLVM issue talking about a specific optimization it should apply, not documentation about the general rules.

Rust doesn't care whether the bits are returned or not. You made the instance of the type exist in a way that violated its validity invariant, which is instant UB. (Same as transmute::<u8, bool>(3) is instant UB.) See also "Producing invalid values" under https://doc.rust-lang.org/nomicon/what-unsafe-does.html.

1 Like

This is what the reference says about uninitialized values:

  • Producing an invalid value, even in private fields and locals. "Producing" a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation.

Your uninitialized f32 is 1) returned from assume_init, 2) assigned to a local, and 3) passed to another function. Any of those would be UB. Basically, any time the value is considered f32 is UB.

Calling transmute(MaybeUninit::uninit()) is the same as MaybeUninit::uninit().assume_init() so there's no point in doing that.

3 Likes

You made the instance of the type exist in a way that violated its validity invariant, which is instant UB.

to note, that transmute is creating an instance of __m128.
but i suppose the relevant sentence is:

Note that unlike __m128i, the integer version of the 128-bit registers, this __m128 type has one interpretation. Each instance of __m128 always corresponds to f32x4, or four f32 types packed together.

from __m128 in core::arch::x86 - Rust

ie i'm creating two floats from undef.

now, this isn't really the place to discuss this. but i wonder, what's the point of making that instant UB? instead of "if such a value is observed". this seems needlessly restrictive :woman_shrugging:

Yeah, it's literally pub struct __m128(f32, f32, f32, f32); with no apparent special behavior, so it's the same as transmuting it into [f32; 4].

dang.
alright, thanks everyone! (also @scottmcm, @semicoleon, and @H2CO3)

i think i'll just use a scalar impl on x86 and hope the vectorizer does its job.

Why not write a straightforward "scalar" code? Optimized result is effectively the assembly you want: Compiler Explorer

UPD: Ah, I see it was already discussed.

1 Like

from my testing on aarch64, scalar code isn't realiable, and often does not choose the right instructions.

for example,

    #[inline(always)]
    fn b32_select_u32(mask: [B32; 2], on_true: [u32; 2], on_false: [u32; 2]) -> [u32; 2] {
        [mask[0].as_u32()&on_true[0] | !mask[0].as_u32()&on_false[0],
         mask[1].as_u32()&on_true[1] | !mask[1].as_u32()&on_false[1]]
    }

generates vector bit ops, but doesn't use vbsl, instead it uses vorr and vand

Remember that's in a macro. It adds #[repr(simd)] (https://doc.rust-lang.org/1.71.0/src/core/stdarch/crates/core_arch/src/macros.rs.html#60), which is pretty magical.

Well, you're passing it to a function, which is one place where it's absolutely essential to have validity met.

(And, in general, trying to define "observed" differently from "always" is a huge rabbit hole that comes out just not being useful. We never want to have to say "no, it's UB now because you extracted part of it into a function", for example.)

1 Like