It seems that it is because of loop count checks, vector capacity (necessity of reallocation) checks, and SIMD optimization.
In collect
version:
At last, the compiler optimizes this simple loop into quite efficient copying.
- Bulk copy by SSE2.
- Loop unrolling: loop condition is not checked every time, but for every 8 elements.
.LBB1_2:
; Copy 8 elements (for `Option<u32>` case).
; This may differ for other types.
; Note that `xmm0` register can store two 64-bit integers (i.e. two `Option<u32>`).
movups xmmword ptr [rax + 8*rcx], xmm0
movups xmmword ptr [rax + 8*rcx + 16], xmm0
movups xmmword ptr [rax + 8*rcx + 32], xmm0
movups xmmword ptr [rax + 8*rcx + 48], xmm0
; Add 8 to the counter.
add rcx, 8
; Check loop condition.
cmp rcx, 200000000
jne .LBB1_2
; Create the `Vec`: Set pointer, length, and capacity.
mov qword ptr [rbx], rax
mov qword ptr [rbx + 8], 200000000
mov qword ptr [rbx + 16], 200000000
On the other hand, if you push()
it maually, the program would check the loop condition and vector capacity everytime you push()
.
.LBB2_9:
; Copy single element.
mov qword ptr [rdi + 8*r13], r14
; Increment the counter.
add r13, 1
; Check loop condition.
cmp r13, 200000000
je .LBB2_10
; Check the necessity of reallocation?
cmp r13, rsi
jne .LBB2_9
See https://godbolt.org/z/j5nJo2 for compiled code.
I don't measured the code and don't know this is really the cause of the difference.
However, the compiled code actually seems very efficient for .collect()
ing-TrustedLen
-iterator version.
Theoretically the compiler may be able to optimize the .push()
version more efficient as well.
However, I don't know it is practically possible, and it may cause slower compilation...
EDIT: Sorry, I misread the assembly. Length overflow checks are done at the time of reallocation, but not in the loop. However the capacity seems to be checked every time.