Unnecessary performance penalty for mem::MaybeUninit?

I agree that you misunderstood the term used here. (But for the other terms you mention, it seems to me that you understood them correctly.) I can give some more details about it: The term "duplication" here is meant in the sense that if a value is duplicated, you can end up running its destructor twice, because semantically, you now have the object two times. For example, you can duplicate objects via the unsafe std::ptr::read method like this:

fn duplicate(s: String) -> (String, String) {
    let s2 = unsafe { std::ptr::read(&s) };
    (s, s2)
}

fn main() {
    let my_string = "foo".to_string();
    
    let (s1, s2) = duplicate(my_string);
    
    drop(s1);
    drop(s2); // OH NO! double free! BOOM
}

What the documentation was talking about is the fact that you can write an implementation of my duplicate method with ManuallyDrop::take, but you cannot write an implementation of it using only ManuallyDrop::into_inner.

Now, on the other hand, the program I listed might actually make three spaces for a String on the stack: one for my_string, one for s1, and one for s2. However, even though there are three spaces on the stack that might store a string, the string was only duplicated once, not twice. The my_string space will be considered "stale" or "unused" or "uninitialized" once you move the value out of that space. (Obviously, we would want the optimizer to not do it like this, but there's no rule that it can't.)

4 Likes

Put another way, Rust makes a distinction between a value (some instantiation of a type T) and its representation (the particular bytes used to store a T in memory). Even though some bytes may be copied during a move operation, the old bytes are invalidated and no longer considered to be a value of type T, thereby preventing duplication of the value— There is only one instance before the move, and still only one after the move.

3 Likes

The number of stack-to-stack memcpys is arguably not observable behavior (in the compiler sense), so it's not strictly a bug. For example, there are even more of them in debug mode, and that's fine -- maybe even good, for debugability -- so it's not a violation of Rust's semantics. And thus it's either a quality of implementation issue or a feature request for additional guaranteed optimizations (though those are hard).

As an analogy, it's also not incorrect for Rust to fail to inline a function marked #[inline(always)]. Certainly a compiler that never inlines such functions is one that you might not want to use, but it's not illegal for the compiler to behave like that. A language can't feasibly promise anything about the performance of code generated by implementations -- and probably doesn't want to anyway, as doing so would plausibly prevent useful things like alternative backends.

3 Likes

Unfortunately, if we guarantee that stack locals do not have overlapping addresses, the presence of stack memcpy is observable and guaranteed behavior.

It's for this reason I think we might need to end up with both move-deinit and overlapping-stack-alloca[1] semantics. Without allowing the two stack locals to have the same address, the optimizer has to prove that you don't observe at least one of the addresses, so that you can't observe that the two stack allocations overlap.

There's all sorts of thought around "allocation planes" and how we actually allow stack allocation to be formally removable anyway, if stack exhaustion is an observable effect. (The way current compilers justify it is either by saying memory exhaustion is a nonobservable condition external to the Virtual Machine, or just handwavy vibes along with the rest of the pointer provenance story.)

It's not (just) that the stack locals are (potentially) guaranteed to live at different addresses, though; an altered version of @afetisov's example proves that. With the inlining LLVM should be able to fairly trivially show that the address of the first ManuallyDrop is never observed (no ptr2int to allow leaking the address outside of provenance involved, so the optimizer can be sure it sees all uses of the first alloca).

ManuallyDrop or any other wrapper isn't even involved in this missed optimization; even just copying an array through a temporary binding introduces the extra memcpy. Here's another example of the same.

I believe the missed optimization is due to the array initialization. Consider this example:

pub unsafe fn no_memcpy() {
    let a = [0x0101_u16; 1000];
    let b = a;
    print_slice(&b);
}

pub unsafe fn yes_memcpy() {
    let a = [0x0102_u16; 1000];
    let b = a;
    print_slice(&b);
}

The former initializes the array with a memset and contains no memcpy. The latter initializes the array with a loop and then memcpys it to a new location. So IIUC, whats going on is that the array initialization causes LLVM to be unwilling (for whatever reason) to optimize out the initial stack allocation, instead keeping it at a separate address from the address which we do observe (by passing it to the opaque unknown output function).

Here's the exact same missed optimization in C/C++[2]. If someone wants to report this upstream to LLVM (or find an open issue covering it), here's the unoptimized LLIR clang produces from that C/C++ example:

define dso_local void @_Z10yes_memcpyv() #0 !dbg !188 {
  %1 = alloca [1000 x i16], align 16
  %2 = alloca i32, align 4
  %3 = alloca [1000 x i16], align 16
  call void @llvm.dbg.declare(metadata ptr %1, metadata !193, metadata !DIExpression()), !dbg !197
  call void @llvm.dbg.declare(metadata ptr %2, metadata !198, metadata !DIExpression()), !dbg !200
  store i32 0, ptr %2, align 4, !dbg !200
  br label %4, !dbg !201

4:                                                ; preds = %11, %0
  %5 = load i32, ptr %2, align 4, !dbg !202
  %6 = icmp slt i32 %5, 1000, !dbg !204
  br i1 %6, label %7, label %14, !dbg !205

7:                                                ; preds = %4
  %8 = load i32, ptr %2, align 4, !dbg !206
  %9 = sext i32 %8 to i64, !dbg !207
  %10 = getelementptr inbounds [1000 x i16], ptr %1, i64 0, i64 %9, !dbg !207
  store i16 258, ptr %10, align 2, !dbg !208
  br label %11, !dbg !207

11:                                               ; preds = %7
  %12 = load i32, ptr %2, align 4, !dbg !209
  %13 = add nsw i32 %12, 1, !dbg !209
  store i32 %13, ptr %2, align 4, !dbg !209
  br label %4, !dbg !210, !llvm.loop !211

14:                                               ; preds = %4
  call void @llvm.dbg.declare(metadata ptr %3, metadata !214, metadata !DIExpression()), !dbg !215
  %15 = getelementptr inbounds [1000 x i16], ptr %3, i64 0, i64 0, !dbg !216
  %16 = getelementptr inbounds [1000 x i16], ptr %1, i64 0, i64 0, !dbg !216
  call void @llvm.memcpy.p0.p0.i64(ptr align 16 %15, ptr align 16 %16, i64 1000, i1 false), !dbg !216
  %17 = getelementptr inbounds [1000 x i16], ptr %3, i64 0, i64 0, !dbg !217
  call void @_Z9print_ptrPKt(ptr noundef %17), !dbg !218
  ret void, !dbg !219
}

define dso_local void @_Z9no_memcpyv() #0 !dbg !220 {
  %1 = alloca [1000 x i16], align 16
  %2 = alloca i32, align 4
  %3 = alloca [1000 x i16], align 16
  call void @llvm.dbg.declare(metadata ptr %1, metadata !221, metadata !DIExpression()), !dbg !222
  call void @llvm.dbg.declare(metadata ptr %2, metadata !223, metadata !DIExpression()), !dbg !225
  store i32 0, ptr %2, align 4, !dbg !225
  br label %4, !dbg !226

4:                                                ; preds = %11, %0
  %5 = load i32, ptr %2, align 4, !dbg !227
  %6 = icmp slt i32 %5, 1000, !dbg !229
  br i1 %6, label %7, label %14, !dbg !230

7:                                                ; preds = %4
  %8 = load i32, ptr %2, align 4, !dbg !231
  %9 = sext i32 %8 to i64, !dbg !232
  %10 = getelementptr inbounds [1000 x i16], ptr %1, i64 0, i64 %9, !dbg !232
  store i16 257, ptr %10, align 2, !dbg !233
  br label %11, !dbg !232

11:                                               ; preds = %7
  %12 = load i32, ptr %2, align 4, !dbg !234
  %13 = add nsw i32 %12, 1, !dbg !234
  store i32 %13, ptr %2, align 4, !dbg !234
  br label %4, !dbg !235, !llvm.loop !236

14:                                               ; preds = %4
  call void @llvm.dbg.declare(metadata ptr %3, metadata !238, metadata !DIExpression()), !dbg !239
  %15 = getelementptr inbounds [1000 x i16], ptr %3, i64 0, i64 0, !dbg !240
  %16 = getelementptr inbounds [1000 x i16], ptr %1, i64 0, i64 0, !dbg !240
  call void @llvm.memcpy.p0.p0.i64(ptr align 16 %15, ptr align 16 %16, i64 1000, i1 false), !dbg !240
  %17 = getelementptr inbounds [1000 x i16], ptr %3, i64 0, i64 0, !dbg !241
  call void @_Z9print_ptrPKt(ptr noundef %17), !dbg !242
  ret void, !dbg !243
}

The only initial difference ignoring metadata does seem to be that store i16 257, ptr %10, align 2 versus store i16 258, ptr %10, align 2.


  1. More precisely, that a stack allocation may overlap other stack allocations in the same frame (and this overlap may be observed) if at all periods only one of the stack allocations is initialized. (It might be better to instead talk about "live," -- initialization as tracked by the borrow checker -- such that a MaybeUninit binding can be "live" and "initialized" to uninit and guaranteed not to overlap other stack allocations.) But also it's useful to kill the storage of early-droppable types, so it's complicated. ↩︎

  2. "There's no such language as C/C++!" Yes, but it's a convenient shorthand for "C-compatible code ran through a C++ compiler." ↩︎

6 Likes

@CAD97 : Thank you very much for this detailed analyis! So the only possibility is to work with the solution of @alice (which in terms of c++ is like a reinterpret_cast).

You may be interested in following @pcwalton's work at https://arewestackefficientyet.com/ - there is some work being done to teach the Rust compiler stack to avoid copying when it's not needed.

3 Likes

Speaking of, @pcwalton have any of the improvements landed yet? Might be time for an updated measurement

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.