Filling vector with for in vs range with iterator. Extremely different performance

Hello. I've found out that if I fill a vector with for in loop and iterator, there is a huge performance difference. Why is that? These two examples should be almost equal. However, I'm still a beginner in rust.

Range with iterator which takes a units of ms I guess. It's not really measurable

let mut initial_nodes: Vec<Option<HashNode>> = [0..200_000_000].iter().map(|_| {
    None
});

For in loop that takes around 600ms!!

let mut initial_nodes: Vec<Option<HashNode>> = Vec::with_capacity(200_000_000);

for _ in 0..200_000_000 {
    initial_nodes.push(None);
}

Where's the difference?
Thanks :slight_smile:

PS: It's part of a competitive programming so every ms counts

1 Like

Your first list has length one. You are creating an array containing a single range, then replacing that range with a single None with map, and collecting that.

2 Likes

The first code snippet doesn't do what you expect. It doesn't iterate over the range, it creates an array with one element and iterates over it, therefore creating the one-element Vec.

After fixing this and reducing the elements count (to avoid hitting the timeout), I get the following (playground):

Mapped 200000 nodes in 236169 ns
Pushed 200000 nodes in 355223 ns

Not sure what's the reason for this difference though.

2 Likes

It seems that it is because of loop count checks, vector capacity (necessity of reallocation) checks, and SIMD optimization.


In collect version:

At last, the compiler optimizes this simple loop into quite efficient copying.

  • Bulk copy by SSE2.
  • Loop unrolling: loop condition is not checked every time, but for every 8 elements.
.LBB1_2:
        ; Copy 8 elements (for `Option<u32>` case).
        ; This may differ for other types.
        ; Note that `xmm0` register can store two 64-bit integers (i.e. two `Option<u32>`).
        movups  xmmword ptr [rax + 8*rcx], xmm0
        movups  xmmword ptr [rax + 8*rcx + 16], xmm0
        movups  xmmword ptr [rax + 8*rcx + 32], xmm0
        movups  xmmword ptr [rax + 8*rcx + 48], xmm0
        ; Add 8 to the counter.
        add     rcx, 8
        ; Check loop condition.
        cmp     rcx, 200000000
        jne     .LBB1_2

        ; Create the `Vec`: Set pointer, length, and capacity.
        mov     qword ptr [rbx], rax
        mov     qword ptr [rbx + 8], 200000000
        mov     qword ptr [rbx + 16], 200000000

On the other hand, if you push() it maually, the program would check the loop condition and vector capacity everytime you push().

.LBB2_9:
        ; Copy single element.
        mov     qword ptr [rdi + 8*r13], r14
        ; Increment the counter.
        add     r13, 1
        ; Check loop condition.
        cmp     r13, 200000000
        je      .LBB2_10
        ; Check the necessity of reallocation?
        cmp     r13, rsi
        jne     .LBB2_9

See https://godbolt.org/z/j5nJo2 for compiled code.

I don't measured the code and don't know this is really the cause of the difference.
However, the compiled code actually seems very efficient for .collect()ing-TrustedLen-iterator version.

Theoretically the compiler may be able to optimize the .push() version more efficient as well.
However, I don't know it is practically possible, and it may cause slower compilation...


EDIT: Sorry, I misread the assembly. Length overflow checks are done at the time of reallocation, but not in the loop. However the capacity seems to be checked every time.

4 Likes