No. This depends much on what the loop is doing, or more directly stated, what the LLVM optimizer is doing to the loop. If all goes to plan (and as you see without .step_by()
it does for simple loops), the for-loop-machinery can lead to better code.
In particular, looking at the generated assembly for for_loop
:
_ZN8for_loop20h1b8a955f18e00a21faaE:
cmpq %fs:112, %rsp
ja .LBB0_2
movabsq $8, %r10
movabsq $0, %r11
callq __morestack
retq
.LBB0_2:
pushq %rax
.Ltmp0:
movl $1, (%rsp)
leaq (%rsp), %rax
movl (%rsp), %eax
testl %eax, %eax
js .LBB0_9
movabsq $4294967296000000, %rcx
movabsq $-4294967296, %r8
leaq 4(%rsp), %rsi
jmp .LBB0_4
.LBB0_8:
movl $1, 4(%rsp)
.LBB0_4:
movq %rcx, %rdi
shrq $32, %rdi
cmpl %edi, %ecx
jge .LBB0_9
movl %ecx, %edx
addl %eax, %edx
jno .LBB0_7
andq %r8, %rcx
orq %rdi, %rcx
jmp .LBB0_8
.LBB0_7:
movl %edx, %edx
andq %r8, %rcx
orq %rdx, %rcx
jmp .LBB0_8
.LBB0_9:
popq %rax
retq
We can see that the loop was partially unrolled, whereas the while loop was not:
_ZN10while_loop20h0ba3546ff3fcadc7XaaE:
cmpq %fs:112, %rsp
ja .LBB1_2
movabsq $4, %r10
movabsq $0, %r11
callq __morestack
retq
.LBB1_2:
subq $4, %rsp
.Ltmp2:
movl $1000000, %eax
leaq (%rsp), %rcx
.LBB1_3:
movl $1, (%rsp)
decl %eax
jne .LBB1_3
addq $4, %rsp
retq
Note also that this example could have been more succinctly written as (0..1_000_000).map(test::black_box).sum()
, which incidentially has about the same performance as the for loop (as it should – it generates the same assembly).