Fastest way to create a byte array

This was probably already asked before, but maybe I am too dumb to find it:
I want to create an array on the stack. This is how I would implement it in C:

uint8_t buffer[4000];

It will be filled later, so it doesn't matter that it is uninitialized.

It seems like in rust we have the option to initialize the array:

let x:[u8; 4000] = [0u8;4000];

But this should be slower than the C solution, right? So I found this:

let x:[u8; 4000] = unsafe{std::mem::uninitialized()};

But this is even slower, than the previous solution.

What is the fasted way to create an uninitialized array in rust?

1 Like

Have you actually measured that?

If not how can we assume "right".

mem::uninitialized is deprecated for a reason; don't use it.

The simple-and-fast way is https://doc.rust-lang.org/nightly/std/mem/union.MaybeUninit.html#method.uninit_array, though that's not stable yet, so you might need to look at what it does internally.

And if you're trying to measure stuff, be sure you're using a library like this:

Yes, I tried MaybeUninit as well. But the performance was the same. I measured it simply with Instant::now(). Will have a look at Criterion!

Thanks

A Rust integer is not allowed to be uninitialized, which is why you need to go through MaybeUninit.

1 Like

Did you compile in release mode?

1 Like

The performance will probably seem identical because the code is significantly faster than the resolution of your PC's clock, and in a lot of cases LLVM will see the buffer is never used so we can optimise it out. Proper benchmarking frameworks have ways to deal with these sorts of things.

By looking at the generated assembly I noticed that LLVM won't elide the zeroing even when compiling in release mode (cargo build --release), and had to reach for MaybeUninit to create a buffer of uninitialized data.

use std::mem::MaybeUninit;

pub fn squares_zeroed() -> [usize; 1024] {
    let mut buffer = [0; 1024];

    for i in 0..buffer.len() {
        buffer[i] = i*i;
    }

    buffer
}

pub fn squares_uninit() -> [usize; 1024] {
    unsafe {
        let mut buffer = MaybeUninit::uninit_array::<1024>();

        for i in 0..buffer.len() {
            buffer[i].write(i*i);
        }

        // SAFETY: The buffer is now initialized
        std::mem::transmute(buffer)
    }
}

The main difference between the two is that squares_array() calls memset() before the loop starts.

You can look at my code on godbolt if you want to learn more.

1 Like

squares_zeroed calls memset and squares_uninit calls memcpy. That's why I would expect squares_uninit to be slower. I will try to run some usable benchmarks today. Before I ran it in debug mode.

You may be interested in this (part of the) tutorial:

1 Like

In theory the uninit version should be faster, however there's currently a bug which makes LLVM not eliding that memcpy. See https://github.com/rust-lang/rust/issues/79754 and https://github.com/rust-lang/rust/issues/79914

5 Likes

Thank you all!
Short follow up question: For no_std I am stuck with array initialization, right?

No, MaybeUninit is also in core.

2 Likes

It seams like the uninit version is still faster, despite the memcpy bug.
Here is my criterion benchmark setup:

//lib.rs
#![feature(maybe_uninit_uninit_array, min_const_generics)]

pub fn array_init<const LEN: usize>() -> [u8; LEN] {
    let x = [0u8; LEN];
    x
}

pub fn array_uninit<const LEN: usize>() -> [MaybeUninit<u8>; LEN] {
    MaybeUninit::uninit_array()
}
//my_benchmark.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rust_bench::*;

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("array init 256", |b| b.iter(|| array_init::<256>()));
    c.bench_function("array uninit 256", |b| b.iter(|| array_uninit::<256>()));
    c.bench_function("array init 512", |b| b.iter(|| array_init::<512>()));
    c.bench_function("array uninit 512", |b| b.iter(|| array_uninit::<512>()));
    c.bench_function("array init 1024", |b| b.iter(|| array_init::<1024>()));
    c.bench_function("array uninit 1024", |b| b.iter(|| array_uninit::<1024>()));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Results:
array init 256 time: [31.249 ns 31.352 ns 31.464 ns]
array uninit 256 time: [30.689 ns 30.780 ns 30.877 ns]
array init 512 time: [69.623 ns 69.774 ns 69.930 ns]
array uninit 512 time: [61.459 ns 61.599 ns 61.783 ns]
array init 1024 time: [134.12 ns 134.46 ns 134.82 ns]
array uninit 1024 time: [124.31 ns 125.42 ns 126.95 ns]

This is explicitly stated as UB, sorry. Not sure what is really measured here.

3 Likes

Yeah, as I said before, it is instant undefined behavior if an integer is ever uninitialized in Rust. You must use a type such as MaybeUninit<u8>.

pub fn array_uninit<const LEN: usize>() -> [MaybeUninit<u8>; LEN] {
    let x: [MaybeUninit<u8>; LEN] = unsafe{std::mem::MaybeUninit::uninit().assume_init()};
    x
}
1 Like

Thanks, I understand now. But this doesn't change the results from above:

pub fn array_uninit<const LEN: usize>() -> [MaybeUninit<u8>; LEN] {
    MaybeUninit::uninit_array()
}

Note that even with a good library like criterion, this is such a small thing that there's really nothing to measure. I would strongly suggest benchmarking whatever's using the 0-initialized or uninitialized array instead, since that's where the speed is relevant.

If you compare the assembly, I think it's clear that most of the time measurement you're seeing is overhead: https://rust.godbolt.org/z/P43TK3

pub fn array_init<const LEN: usize>() -> [u8; LEN] {
    let mut buffer = [0u8; LEN];
    for i in 0..buffer.len() {
        buffer[i] = (i*i) as u8;
    }
    buffer
}

pub fn array_uninit<const LEN: usize>() -> [MaybeUninit<u8>; LEN] {
    let mut buffer = MaybeUninit::uninit_array();
    for i in 0..buffer.len() {
        buffer[i].write((i*i) as u8);
    }
    buffer
}

pub fn array_custom_init() -> [u8; 1024] {
    let mut buffer:[MaybeUninit<u8>;1024] = MaybeUninit::uninit_array();
    for i in 0..buffer.len() {
        buffer[i].write((i*i) as u8);
    }
    unsafe{std::mem::transmute(buffer)}

I ran it for more or less your example and I get this:
array init 1024 time: [808.61 ns 809.44 ns 810.53 ns]
array uninit 1024 time: [607.87 ns 608.10 ns 608.35 ns]
array custom init 1024 time: [825.10 ns 825.39 ns 825.73 ns]

Is that more valuable?

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.