Benchmark: for loop with range slower than while?

benaryorg · June 16, 2015, 9:13am

I have seen it being used a lot in the documentation and I just wanted to know if this is well-known.

Loops using ranges are only about half as performant as loops written with a while and a counting variable, you can see this here.

EDIT: as the first reply suggests the above statement is wrong

Now I have a few questions and suggestions:

is there anything even faster?
optimizing for loops
- are there plans?
- is it possible?
documentation
- are there reasons why it is used in the docs, other than readability?
- should it be replaced?
  - or at least mention it so people use while if their application needs performance

llogiq · June 16, 2015, 9:32am

If you look at the created LLVM IR, you can see that your .step_by(test::black_box(1)) has kept LLVM's optimizer from inlining the whole iterator pipeline. Thus, be careful, your benchmark may not measure what you think it measures.

I've set up a gist with a benchmark that shows this. Note that .step_by(_) is still slower – it could well be that something trips up LLVM's optimizer. The step_by feature is unstable and likely to receive some more optimization before it's stabilized.

benaryorg · June 16, 2015, 9:53am

I have also tried running the benchmark without the step_by and
instead blackboxing the upper value, which yielded the same result. I
thought it might optimize a bit more if I did not replace the ranges
construction value.

Also I am not comfortable with LLVM IR (yet, I will look into it), so
please excuse me if I am, at some point not correct.

I just ran your example, which shows that the for loop really is faster
than the while.

Is this correct in all cases?

llogiq · June 16, 2015, 10:09am

No. This depends much on what the loop is doing, or more directly stated, what the LLVM optimizer is doing to the loop. If all goes to plan (and as you see without .step_by() it does for simple loops), the for-loop-machinery can lead to better code.

In particular, looking at the generated assembly for for_loop:

_ZN8for_loop20h1b8a955f18e00a21faaE:
	cmpq	%fs:112, %rsp
	ja	.LBB0_2
	movabsq	$8, %r10
	movabsq	$0, %r11
	callq	__morestack
	retq
.LBB0_2:
	pushq	%rax
.Ltmp0:
	movl	$1, (%rsp)
	leaq	(%rsp), %rax
	movl	(%rsp), %eax
	testl	%eax, %eax
	js	.LBB0_9
	movabsq	$4294967296000000, %rcx
	movabsq	$-4294967296, %r8
	leaq	4(%rsp), %rsi
	jmp	.LBB0_4
.LBB0_8:
	movl	$1, 4(%rsp)
.LBB0_4:
	movq	%rcx, %rdi
	shrq	$32, %rdi
	cmpl	%edi, %ecx
	jge	.LBB0_9
	movl	%ecx, %edx
	addl	%eax, %edx
	jno	.LBB0_7
	andq	%r8, %rcx
	orq	%rdi, %rcx
	jmp	.LBB0_8
.LBB0_7:
	movl	%edx, %edx
	andq	%r8, %rcx
	orq	%rdx, %rcx
	jmp	.LBB0_8
.LBB0_9:
	popq	%rax
	retq

We can see that the loop was partially unrolled, whereas the while loop was not:

_ZN10while_loop20h0ba3546ff3fcadc7XaaE:
    cmpq    %fs:112, %rsp
    ja    .LBB1_2
    movabsq    $4, %r10
    movabsq    $0, %r11
    callq    __morestack
    retq
.LBB1_2:
    subq    $4, %rsp
.Ltmp2:
    movl    $1000000, %eax
    leaq    (%rsp), %rcx
.LBB1_3:
    movl    $1, (%rsp)
    decl    %eax
    jne    .LBB1_3
    addq    $4, %rsp
    retq

Note also that this example could have been more succinctly written as (0..1_000_000).map(test::black_box).sum(), which incidentially has about the same performance as the for loop (as it should – it generates the same assembly).

Topic		Replies	Views
Simple conditional loop code review	12	821	March 7, 2021
Iterators vs loops, with skip help	5	935	February 21, 2021
Performance difference among Range, RangeInclusive and reversed help	19	3400	January 12, 2023
We all know `iter` is faster than `loop`, but why?	23	14287	February 16, 2021
For vs for_each	7	802	November 26, 2025

Benchmark: for loop with range slower than while?

Related topics