Large uninit arrays in stack performance

Hi all,

pub const MAX_LEN: usize = 20000000;

fn test_function () 
 {
    ...
   let mut buffer_uninit: [MaybeUninit<u8>; MAX_LEN] = MaybeUninit::uninit_array();

  let mut buffer: &mut [u8] = unsafe {
                                        let ptr = buffer_uninit.as_mut_ptr() as *mut u8;
                                        std::slice::from_raw_parts_mut(ptr, buffer_uninit.len())
                                       };

 ....
}

fn main() {

  ....

    for _i in 0..1_000_000
    {
      test_function ();
    }

 ....

}

The following code runs extremely slow when it has been compiled in debug mode, but runs fast when it has been compiled with optimization.

What is a proper way to get uninitialized large array in stack ?

(I know all about stack overflow and other issues) .... There is a question about proper way of getting uninit. arrays in stack generally ?

Can you please post a complete example?

Hi,

#![feature(maybe_uninit_uninit_array)]

use std::mem::{MaybeUninit};
use std::time::{Instant};

pub const MAX_LEN: usize = 1024 * 1024;

// pub const MAX_LEN: usize = 1024;

fn test_function ()
{

let mut buffer_uninit: [MaybeUninit<u8>; MAX_LEN] = MaybeUninit::uninit_array();

let mut buffer: &mut [u8] = unsafe {
                                    let ptr = buffer_uninit.as_mut_ptr() as *mut u8;
                                    std::slice::from_raw_parts_mut(ptr, buffer_uninit.len())
                                   };



buffer [10] = 5;

}

fn main() {

let start_timer = Instant::now();


for _i in 0..100_000
{
 test_function ();
}

let stop_timer = Instant::now();

let duration = stop_timer.checked_duration_since(start_timer).unwrap();

println!("\nDuration: '{}' sec(s), '{:0>9}' nanosecond(s)\n", duration.as_secs(), duration.subsec_nanos());

}

Compiled in DEBUG:

Duration: '6' sec(s), '725928060' nanosecond(s)

Compiled in RELEASE:

Duration: '0' sec(s), '000000070' nanosecond(s)

This is an example I brought, to show the mysterious things happens with uninit. arrays.

What is the proper way to get uniinit array in rust ?

The release build likely optimizes out the whole thing because it doesn't do anything externally observable.

This is.

By the way, your test function has UB. If your array elements are not all initialized, then you must not create a slice with the initialized type. References must always point to valid (initialized) values.

2 Likes

Thank you for answers. I have updated function to:

fn test_function ()
{

let mut buffer_uninit: [MaybeUninit<u8>; MAX_LEN] = MaybeUninit::uninit_array();

let mut buffer: &mut [u8] = unsafe {
                                    let ptr = buffer_uninit.as_mut_ptr() as *mut u8;
                                    std::slice::from_raw_parts_mut(ptr, buffer_uninit.len())
                                   };

unsafe {
for i in 0..buffer.len()
{
buffer [i] = rand() as u8;
}
}
}

This is quick "load", to make function do something observable, additionally to init the array.

But the question is still, why in DEBUG mode it takes seconds (!!!!!!) (and maybe minutes) and in RELEASE mode it takes nanoseconds ??? When the buffer length is small enough (MAX_LEN) it works quickly, but with large values it takes for seconds to only get uninit array ???

Your function still doesn't do anything observable, since it doesn't return anything or access any memory outside the function (through a reference passed in, for example). It also still has undefined behaviour because it still creates a &mut [u8] pointing to uninitialised memory.

The correct way to work with uninitialised memory is to intialise it using write() before converting it to some initialised type.

1 Like

Thanks. The test function actually now fills in random data into buffer. And it is observable.

The general problem:

I have a large array in stack. I need, it to have it uninitialized on the first stage. Because, I don't see any sense to init it, and waste CPU time. For example, on the next stage of the program, it will be filled with data of user input. So. what is the sense to init it, when data will be later overwritten with new values ? In C and С++, it could be done easily. And in rust, having large stack leads to slow program for minutes ...

What I am doing wrong ?

Note: if you choose Show ASM, you can see what happened in release mode:

playground::main: # @playground::main
# %bb.0:
	push	rbp
	push	r15
	push	r14
	push	r12
	push	rbx
	sub	rsp, 224
	call	qword ptr [rip + std::time::Instant::now@GOTPCREL]
	mov	r14, rax
	mov	ebx, edx
	xor	ebp, ebp
	mov	r15, qword ptr [rip + rand@GOTPCREL]

.LBB5_1:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB5_2 Depth 2
	mov	r12d, 1024

.LBB5_2:                                #   Parent Loop BB5_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	call	r15 # Here!
	dec	r12
	jne	.LBB5_2
# %bb.3:                                #   in Loop: Header=BB5_1 Depth=1
	inc	ebp
	cmp	ebp, 100000
	jne	.LBB5_1

I think it's slow in debug mode because:

playground::test_function: # @playground::test_function
# %bb.0:
	sub	rsp, 3208
	lea	rdi, [rsp + 2168]
	lea	rsi, [rsp + 1064]
	mov	edx, 1024
	call	memcpy@PLT
	lea	rdi, [rsp + 40]
	lea	rsi, [rsp + 2168]
	mov	edx, 1024
	call	memcpy@PLT
	lea	rax, [rsp + 40]
	mov	qword ptr [rsp + 3192], rax
	mov	qword ptr [rsp + 3200], 1024
	lea	rax, [rsp + 40]
	mov	qword ptr [rsp + 2136], rax
	lea	rdi, [rsp + 40]
	mov	esi, 1024
	call	core::slice::raw::from_raw_parts_mut

HI,

Too many (?) calls to "call memcpy@PLT". Is it stack efficient ?

And why:

let mut buffer_uninit: [MaybeUninit; MAX_LEN] = MaybeUninit::uninit_array();

let mut buffer: &mut [u8] = unsafe {
let ptr = buffer_uninit.as_mut_ptr() as *mut u8;
std::slice::from_raw_parts_mut(ptr, buffer_uninit.len())
};

would require memcpy if it is only mechanism for "casting" pointers and setting length of the slice (even in DEBUG mode ... ) ?

emmm... I'm not an expert in compiler
I think we can't optimize it(in dev profile)? In llvm-ir generated by rustc (with -C opt-level=0, same as cargo's dev profile):

; example::test_function
define internal void @example::test_function() unnamed_addr {
start:
  %slot.i = alloca %"core::mem::manually_drop::ManuallyDrop<[core::mem::maybe_uninit::MaybeUninit<u8>; 1024]>", align 1
  %self.i = alloca %"core::mem::maybe_uninit::MaybeUninit<[core::mem::maybe_uninit::MaybeUninit<u8>; 1024]>", align 1
  %buffer_uninit = alloca [1024 x i8], align 1
  call void @llvm.memcpy.p0.p0.i64(ptr align 1 %slot.i, ptr align 1 %self.i, i64 1024, i1 false)
  call void @llvm.memcpy.p0.p0.i64(ptr align 1 %buffer_uninit, ptr align 1 %slot.i, i64 1024, i1 false)
; call core::slice::raw::from_raw_parts_mut
  %0 = call { ptr, i64 } @core::slice::raw::from_raw_parts_mut(ptr %buffer_uninit, i64 1024)
  %_7.0 = extractvalue { ptr, i64 } %0, 0
  %_7.1 = extractvalue { ptr, i64 } %0, 1
  ret void
}

code on godbolt.org
I don't know if this is an intentional design. it just allocates 3 values on stack, then memcpy them around?

maybe you can just edit opt-level of dev profile inside cargo.toml

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.