Performance difference among Range, RangeInclusive and reversed


#1

I ran this simple code:

  #[test]
  fn play() {
    let now = Instant::now();
    let mut result: usize = 0;
    for j in (1..100_0000 + 1).rev() {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());

    let now = Instant::now();
    let mut result: usize = 0;
    for j in (1..=100_0000).rev() {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());

    let now = Instant::now();
    let mut result: usize = 0;
    for j in 1..100_0001 {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());

    let now = Instant::now();
    let mut result: usize = 0;
    for j in 1..=100_0000 {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());
  }

and got the output:

running 1 test
500000500000
23.868379ms
500000500000
53.537635ms
500000500000
54.150509ms
500000500000
51.44193ms

Why is the reversed Range (exlusive) version faster than others?


#2

Depends on who you ask.

   Compiling playground v0.0.1 (file:///playground)
    Finished release [optimized] target(s) in 1.32s
     Running `target/release/playground`

Standard Output

500000500000
28.772µs
500000500000
102.534µs
500000500000
1.828µs
500000500000
1.008266ms

(That was on the playpen, by the way).


#3

Oh, I forgot to add the --release when running test!. After running:

cargo test play --release -- --nocapture

the result is:

running 1 test
500000500000
8.832µs
500000500000
1.879µs
500000500000
1.777µs
500000500000
1.754µs

The first one is now the slowest!


#4

As usual with benchmarks, it’s useful to verify that you are measuring what you think you are measuring. LLVM has no trouble in recognizing arithmetic progression and using Gauss trick to evaluate it at compile time, so, in the assembly output, there are no loops at all, only 500000500000 constant.

So the time here is the syscall time to output stuff to stdout, and microsecond range for syscalls is reasonable. (does anyone knows of a more modern measurements than https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html ?)

EDIT: apparently I am an idiot, and assembly contains loops for the second two cycles?


#5

Hm, so stable Rust indeed fails to optimze the inclusive loops, but its all-ok on nightly.

I’ve used the following program:

pub fn main() {
    let mut result: usize = 0;
    for j in (1..100_0000 + 1).rev() {
      result += j
    }
    bb(result);

    let mut result: usize = 0;
    for j in (1..=100_0000).rev() {
      result += j
    }
    bb(result);

    let mut result: usize = 0;
    for j in 1..100_0001 {
      result += j
    }
    bb(result);

    let mut result: usize = 0;
    for j in 1..=100_0000 {
      result += j
    }
    bb(result);
}

#[inline(never)]
fn bb(r: usize) {
    println!("{}", r);
}

And the following command-line to get the optimized ir:

 rustc +nightly -C opt-level=3 --emit=llvm-ir main.rs

We see everything evaluated at compile time in the result:

λ rg 'main::main' -A 13 main.ll
59:; main::main
60-; Function Attrs: uwtable
61-define internal void @_ZN4main4main17hd31e54a560abace9E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
62-start:
63-; call main::bb
64-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
65-; call main::bb
66-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
67-; call main::bb
68-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
69-; call main::bb
70-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
71-  ret void
72-}

#6

To elaborate on this a bit, this does not mean that you should immediately start peering into the assembly output. A very simple but very powerful tool to verify if the measurements themselves are correct is to vary the size of the input and check that the timings vary according to predicted asymptotic complexity. In this case, that increasing the number 10 times will lead to 10 times increase in execution time. This helps to catch cases when you don’t measure the relevant thing at all, or when the constant overhead of the measurement dwarfs the time of the algorithm itself.


#7

https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/ was recently written (and Eli is a reputable source :slight_smile:)


#8

ohhh, I’ve wanted exactly this blog post for about two years I think, thank you so much for the link!


#9

I’d recommend using Bencher (#[bench]) for such tests, because it ensures tests are ran long enough to be meaningful. A single run measured with Instant::now() may be too noisy.


#10

This is a good opportunity to remind people that for loops over iterators are sometimes slower (and never faster) than using internal iteration methods on those iterators. (Usual link: https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb)

A quick demo: https://play.rust-lang.org/?gist=925fdc4295261b073c391297e2d3cf66&version=nightly&mode=release

fn test_iter() -> impl Iterator<Item=u32> {
    (0..100).chain(200..300)
}

pub fn sum_iterator_method() -> u32 {
    test_iter().sum()
}

pub fn sum_for_loop() -> u32 {
    let mut sum = 0;
    for x in test_iter() {
        sum += x;
    }
    sum
}

LLVM can completely remove all the loops in the iterator method:

playground::sum_iterator_method:
	mov	eax, 29900
	ret

But with the for loop it cannot:

playground::sum_for_loop:
	mov	edi, 200
	xor	ecx, ecx
	xor	esi, esi
	xor	eax, eax
	mov	edx, esi
	and	dl, 3
	cmp	dl, 1
	jne	.LBB2_2
	jmp	.LBB2_6

.LBB2_1:
	add	eax, ecx
	lea	ecx, [rcx + 1]
	mov	edx, esi
	and	dl, 3
	cmp	dl, 1
	je	.LBB2_6

.LBB2_2:
	cmp	dl, 2
	jne	.LBB2_7
	cmp	edi, 299
	ja	.LBB2_10
	add	eax, edi
	lea	edx, [rdi + 1]
	jmp	.LBB2_5

.LBB2_7:
	cmp	ecx, 100
	jb	.LBB2_1
	cmp	edi, 299
	ja	.LBB2_10
	add	eax, edi
	lea	edx, [rdi + 1]
	mov	sil, 2

.LBB2_5:
	mov	edi, edx
	mov	edx, esi
	and	dl, 3
	cmp	dl, 1
	jne	.LBB2_2

.LBB2_6:
	cmp	ecx, 99
	jbe	.LBB2_1

.LBB2_10:
	ret

TL/DR: If your loop body is very simple, consider phrasing it as fold/try_fold/for_each/etc instead. It won’t be slower, and might be faster, depending on the iterator type.


Upcast to impl Iterator