Fast vector initialization without default value

Hello everyone.

I am trying to optimize my code and I notice that vector initialization takes a lot of time. (I haven't compare with other language tough)

For instance I am doing this to declare a new vector:

let mut buffer: Vec<T> = unsafe {
    vec![std::mem::zeroed(); size]
};

For 100_000_000 elements it takes 13298ns. It is super fast, I am super happy.

BUT:
I never read first a value in this vector before a write, and I write a value in each possible location. So I don't need the default value.
I notice that when I write something in this vector, the first write takes double time.

For example, writing the whole vector with new values, the first time it takes 437883674ns and after that it takes 201314336ns.

After reading this post blog https://www.ralfj.de/blog/2019/07/14/uninit.html and trying everything that can be done with unsafe Rust (std::mem, std::ptr, MaybeUninit, ... etc) I figured out that during the first write, Rust also initialize properly each location in the vector, thus the double time.

So the time spared during the vector declaration is actually somewhere else, during the first write.

So my question is: since I don't need the default value and I never read the vector before a write, is there a way in Rust to spare the initialization time ? It is a significantly gain for my algorithm.

Thanks for reading.

Have you tried Vec::with_capacity(size)? That allocates the space required for the vec, but doesn't actually put any values in it.

2 Likes

Yes I already tried that.

Alone, I can't write at any position, I can only use the push method. And I need to be able to write at any position in this vector.

If I add a buffer.set_len(size), it is the same behavior than what I described above.

I smell UB.

std::mem::zeroed() for a generic T is in a lot of cases undefined behavior.

You should be using MaybeUninit.
You also could write a wrapper for Vec and provide a custom insert method which increases the size of the Vec if necessary.

2 Likes

I can use MaybeUninit, it has the same performance actually. (I already tried)

The vector's size is known, no need for me to resize the vector dynamically. It would impact performance.

What are you trying to put in this vector, and how are you trying to use it afterwards? This makes a big difference, because initializing an array with something Copy gets to take a massive shortcut.

Using Vec::with_capacity() should take next to no time because it just allocates memory without even reading it. It's essentially the overhead of malloc() checking its bookkeeping and bumping a couple counters to reserve some space for you.

The vec![] macro will use the memset() function from libc (std::ptr::write_byte() in Rust), which is one of the most highly optimised functions in existence for setting every byte in an array (it's used in literally everything).


It may help to have some numbers, so I ran some benchmarks on my laptop.

Benchmark Code
#![feature(test)]
extern crate test;
use test::Bencher;

const SIZE: usize = 1_000_000;

#[bench]
fn vec_with_capacity(b: &mut Bencher) {
    b.iter(|| Vec::<u8>::with_capacity(SIZE));
}

#[bench]
fn vec_with_capacity_and_pushing(b: &mut Bencher) {
    b.iter(|| {
        let mut vector = Vec::with_capacity(SIZE);

        for _ in 0..SIZE {
            vector.push(0);
        }

        vector
    });
}

#[bench]
fn vec_macro(b: &mut Bencher) {
    b.iter(|| vec![0_u8; SIZE]);
}

#[bench]
fn vec_with_capacity_and_memset(b: &mut Bencher) {
    b.iter(|| {
        let mut vector = Vec::<u8>::with_capacity(SIZE);

        unsafe { 
            vector.set_len(SIZE); 
            std::ptr::write_bytes(vector.as_mut_ptr(), 0, SIZE);
        }

        vector
    });
}

#[bench]
fn loop_and_push(b: &mut Bencher) {
    b.iter(|| {
        let mut vector = Vec::new();

        for _ in 0..SIZE {
            vector.push(0);
        }
        
        vector
    });
}

For SIZE = 100_000:

$ cargo bench
test loop_and_push                 ... bench:     694,413 ns/iter (+/- 10,511)
test vec_macro                     ... bench:       8,319 ns/iter (+/- 622)
test vec_with_capacity             ... bench:         130 ns/iter (+/- 12)
test vec_with_capacity_and_memset  ... bench:       8,501 ns/iter (+/- 987)
test vec_with_capacity_and_pushing ... bench:     696,194 ns/iter (+/- 41,419)

For SIZE = 1_000_000:

$ cargo bench
test loop_and_push                 ... bench:   6,918,695 ns/iter (+/- 524,425)
test vec_macro                     ... bench:      88,073 ns/iter (+/- 15,316)
test vec_with_capacity             ... bench:         132 ns/iter (+/- 9)
test vec_with_capacity_and_memset  ... bench:      87,081 ns/iter (+/- 14,708)
test vec_with_capacity_and_pushing ... bench:   7,008,704 ns/iter (+/- 730,025)

For SIZE = 50_000_000:

$ cargo bench
test loop_and_push                 ... bench: 634,253,227 ns/iter (+/- 24,916,167)
test vec_macro                     ... bench:      22,132 ns/iter (+/- 2,654)
test vec_with_capacity             ... bench:      21,489 ns/iter (+/- 682)
test vec_with_capacity_and_memset  ... bench:  81,422,411 ns/iter (+/- 8,883,889)
test vec_with_capacity_and_pushing ... bench: 644,344,068 ns/iter (+/- 43,302,454)

These numbers are useless at best, and quite possibly misleading without more information on your use case.

Something that may help is to change your algorithm to make it lazy (generate data on demand) instead of allocating upfront. Or try to identify why you need so much memory (100 million is a big number!) and see if it's possible to make do with less.

3 Likes

Thanks for this very complete response.
I will give more details then.

I am working on a radix sort. (I want to create a crate) (I need a very fast sort for my work).

The sort should work for all Rust primitive types (except the array) and users will be able to add custom type by implementing traits.
The implementation use generic and I use traits to implement each type.

100_000_000 was just an example.

Few runtimes so you could get an idea:

running 1 test
=== Test u64 ===	STD RUST            RUST UNS	      RADIX SORT
Array size: 100
-- Unif       :    2us	            1us		          1us	
Array size: 1000
-- Unif       :    38us	              20us		          9us	
Array size: 10000
-- Unif       :    554us	       218us		      78us	
Array size: 50000
-- Unif       :    2_847us	        1_242us		    413us	
Array size: 100000
-- Unif       :    6_114us	         2_481us		     872us	
Array size: 500000
-- Unif       :    35_619us	        13_410us		5_442us	  
Array size: 1000000
-- Unif       :    75_373us	          27_892us			9_782us	   
Array size: 5000000
-- Unif       :    422_847us	      149_994us		   60_738us	
Array size: 10000000
-- Unif       :    882_897us	       306_678us		      121_378us	
Array size: 50000000
-- Unif       :    4_854_907us	    1_660_546us		       638_241us	
Array size: 100000000
-- Unif       :    10_132_886us		3_418_168us		       1_328_904us	
Array size: 500000000
-- Unif       :    54_704_344us		18_252_342us		 6_961_791us	
Array size: 1000000000
-- Unif       :    111_805_793us		37_726_152us	     13_960_051us	
test tests::speed_sort::speed_test_u64 ... ok

The buffer vector is needed to the LSD Radix sort. An LSD Radix sort needs a buffer array to store values. This buffer is the same size than the input array.

I create the buffer.
Then in the first pass I move all values from the input array into the buffer, but with different positions. (positions are computed with the radix... but that's not the matter here).
Then in the second pass I move all the values from the buffer into the input array.
(like a Ping Pong)
I can do that as many times as required.

So the buffer array does not need to have "values" in it at the beginning.

For example for a 100_000_000 input array size, on my computer, initializing the buffer with not a zeroed value takes about 250_000_00ns, which is about 20% of the sort time.

That's why I try to gain time on this.

I don't know if I am clear.

1 Like

Oh I forgot. All elements must have the Copy trait.

vec![MaybeUninit::uninit(); n] shouldn't have overhead when used correctly. Can you show your code?
I guess cache utilization and memory access pattern optimization will have a big difference in such algorithms. Did you analyze from the perspective?
We could make more comments if you can show me your code.