Benchmark: for loop with range slower than while?


#1

I have seen it being used a lot in the documentation and I just wanted to know if this is well-known.

Loops using ranges are only about half as performant as loops written with a while and a counting variable, you can see this here.

EDIT: as the first reply suggests the above statement is wrong

Now I have a few questions and suggestions:

  • is there anything even faster?
  • optimizing for loops
    • are there plans?
    • is it possible?
  • documentation
    • are there reasons why it is used in the docs, other than readability?
    • should it be replaced?
      • or at least mention it so people use while if their application needs performance

#2

If you look at the created LLVM IR, you can see that your .step_by(test::black_box(1)) has kept LLVM’s optimizer from inlining the whole iterator pipeline. Thus, be careful, your benchmark may not measure what you think it measures.

I’ve set up a gist with a benchmark that shows this. Note that .step_by(_) is still slower – it could well be that something trips up LLVM’s optimizer. The step_by feature is unstable and likely to receive some more optimization before it’s stabilized.


#3

I have also tried running the benchmark without the step_by and
instead blackboxing the upper value, which yielded the same result. I
thought it might optimize a bit more if I did not replace the ranges
construction value.

Also I am not comfortable with LLVM IR (yet, I will look into it), so
please excuse me if I am, at some point not correct.

I just ran your example, which shows that the for loop really is faster
than the while.

Is this correct in all cases?


#4

No. This depends much on what the loop is doing, or more directly stated, what the LLVM optimizer is doing to the loop. If all goes to plan (and as you see without .step_by() it does for simple loops), the for-loop-machinery can lead to better code.

In particular, looking at the generated assembly for for_loop:

_ZN8for_loop20h1b8a955f18e00a21faaE:
	cmpq	%fs:112, %rsp
	ja	.LBB0_2
	movabsq	$8, %r10
	movabsq	$0, %r11
	callq	__morestack
	retq
.LBB0_2:
	pushq	%rax
.Ltmp0:
	movl	$1, (%rsp)
	leaq	(%rsp), %rax
	movl	(%rsp), %eax
	testl	%eax, %eax
	js	.LBB0_9
	movabsq	$4294967296000000, %rcx
	movabsq	$-4294967296, %r8
	leaq	4(%rsp), %rsi
	jmp	.LBB0_4
.LBB0_8:
	movl	$1, 4(%rsp)
.LBB0_4:
	movq	%rcx, %rdi
	shrq	$32, %rdi
	cmpl	%edi, %ecx
	jge	.LBB0_9
	movl	%ecx, %edx
	addl	%eax, %edx
	jno	.LBB0_7
	andq	%r8, %rcx
	orq	%rdi, %rcx
	jmp	.LBB0_8
.LBB0_7:
	movl	%edx, %edx
	andq	%r8, %rcx
	orq	%rdx, %rcx
	jmp	.LBB0_8
.LBB0_9:
	popq	%rax
	retq

We can see that the loop was partially unrolled, whereas the while loop was not:

_ZN10while_loop20h0ba3546ff3fcadc7XaaE:
    cmpq    %fs:112, %rsp
    ja    .LBB1_2
    movabsq    $4, %r10
    movabsq    $0, %r11
    callq    __morestack
    retq
.LBB1_2:
    subq    $4, %rsp
.Ltmp2:
    movl    $1000000, %eax
    leaq    (%rsp), %rcx
.LBB1_3:
    movl    $1, (%rsp)
    decl    %eax
    jne    .LBB1_3
    addq    $4, %rsp
    retq

Note also that this example could have been more succinctly written as (0..1_000_000).map(test::black_box).sum(), which incidentially has about the same performance as the for loop (as it should – it generates the same assembly).