Performance difference among Range, RangeInclusive and reversed

I ran this simple code:

  #[test]
  fn play() {
    let now = Instant::now();
    let mut result: usize = 0;
    for j in (1..100_0000 + 1).rev() {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());

    let now = Instant::now();
    let mut result: usize = 0;
    for j in (1..=100_0000).rev() {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());

    let now = Instant::now();
    let mut result: usize = 0;
    for j in 1..100_0001 {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());

    let now = Instant::now();
    let mut result: usize = 0;
    for j in 1..=100_0000 {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());
  }

and got the output:

running 1 test
500000500000
23.868379ms
500000500000
53.537635ms
500000500000
54.150509ms
500000500000
51.44193ms

Why is the reversed Range (exlusive) version faster than others?

Depends on who you ask.

   Compiling playground v0.0.1 (file:///playground)
    Finished release [optimized] target(s) in 1.32s
     Running `target/release/playground`

Standard Output

500000500000
28.772µs
500000500000
102.534µs
500000500000
1.828µs
500000500000
1.008266ms

(That was on the playpen, by the way).

Oh, I forgot to add the --release when running test!. After running:

cargo test play --release -- --nocapture

the result is:

running 1 test
500000500000
8.832µs
500000500000
1.879µs
500000500000
1.777µs
500000500000
1.754µs

The first one is now the slowest!

As usual with benchmarks, it’s useful to verify that you are measuring what you think you are measuring. LLVM has no trouble in recognizing arithmetic progression and using Gauss trick to evaluate it at compile time, so, in the assembly output, there are no loops at all, only 500000500000 constant.

So the time here is the syscall time to output stuff to stdout, and microsecond range for syscalls is reasonable. (does anyone knows of a more modern measurements than https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html ?)

EDIT: apparently I am an idiot, and assembly contains loops for the second two cycles?

Hm, so stable Rust indeed fails to optimze the inclusive loops, but its all-ok on nightly.

I’ve used the following program:

pub fn main() {
    let mut result: usize = 0;
    for j in (1..100_0000 + 1).rev() {
      result += j
    }
    bb(result);

    let mut result: usize = 0;
    for j in (1..=100_0000).rev() {
      result += j
    }
    bb(result);

    let mut result: usize = 0;
    for j in 1..100_0001 {
      result += j
    }
    bb(result);

    let mut result: usize = 0;
    for j in 1..=100_0000 {
      result += j
    }
    bb(result);
}

#[inline(never)]
fn bb(r: usize) {
    println!("{}", r);
}

And the following command-line to get the optimized ir:

 rustc +nightly -C opt-level=3 --emit=llvm-ir main.rs

We see everything evaluated at compile time in the result:

λ rg 'main::main' -A 13 main.ll
59:; main::main
60-; Function Attrs: uwtable
61-define internal void @_ZN4main4main17hd31e54a560abace9E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
62-start:
63-; call main::bb
64-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
65-; call main::bb
66-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
67-; call main::bb
68-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
69-; call main::bb
70-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
71-  ret void
72-}

To elaborate on this a bit, this does not mean that you should immediately start peering into the assembly output. A very simple but very powerful tool to verify if the measurements themselves are correct is to vary the size of the input and check that the timings vary according to predicted asymptotic complexity. In this case, that increasing the number 10 times will lead to 10 times increase in execution time. This helps to catch cases when you don’t measure the relevant thing at all, or when the constant overhead of the measurement dwarfs the time of the algorithm itself.

7 Likes

https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/ was recently written (and Eli is a reputable source :slight_smile:)

2 Likes

ohhh, I’ve wanted exactly this blog post for about two years I think, thank you so much for the link!

2 Likes

I’d recommend using Bencher (#[bench]) for such tests, because it ensures tests are ran long enough to be meaningful. A single run measured with Instant::now() may be too noisy.

This is a good opportunity to remind people that for loops over iterators are sometimes slower (and never faster) than using internal iteration methods on those iterators. (Usual link: https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb)

A quick demo: https://play.rust-lang.org/?gist=925fdc4295261b073c391297e2d3cf66&version=nightly&mode=release

fn test_iter() -> impl Iterator<Item=u32> {
    (0..100).chain(200..300)
}

pub fn sum_iterator_method() -> u32 {
    test_iter().sum()
}

pub fn sum_for_loop() -> u32 {
    let mut sum = 0;
    for x in test_iter() {
        sum += x;
    }
    sum
}

LLVM can completely remove all the loops in the iterator method:

playground::sum_iterator_method:
	mov	eax, 29900
	ret

But with the for loop it cannot:

playground::sum_for_loop:
	mov	edi, 200
	xor	ecx, ecx
	xor	esi, esi
	xor	eax, eax
	mov	edx, esi
	and	dl, 3
	cmp	dl, 1
	jne	.LBB2_2
	jmp	.LBB2_6

.LBB2_1:
	add	eax, ecx
	lea	ecx, [rcx + 1]
	mov	edx, esi
	and	dl, 3
	cmp	dl, 1
	je	.LBB2_6

.LBB2_2:
	cmp	dl, 2
	jne	.LBB2_7
	cmp	edi, 299
	ja	.LBB2_10
	add	eax, edi
	lea	edx, [rdi + 1]
	jmp	.LBB2_5

.LBB2_7:
	cmp	ecx, 100
	jb	.LBB2_1
	cmp	edi, 299
	ja	.LBB2_10
	add	eax, edi
	lea	edx, [rdi + 1]
	mov	sil, 2

.LBB2_5:
	mov	edi, edx
	mov	edx, esi
	and	dl, 3
	cmp	dl, 1
	jne	.LBB2_2

.LBB2_6:
	cmp	ecx, 99
	jbe	.LBB2_1

.LBB2_10:
	ret

TL/DR: If your loop body is very simple, consider phrasing it as fold/try_fold/for_each/etc instead. It won’t be slower, and might be faster, depending on the iterator type.

12 Likes

Sorry to revive this post as it is more than 3 years old, but is it still true that RangInclusive is not correctly optimized compared to Range?

It is certainly correctly optimized, but the optimizations may be worse.

1 Like

So it is still advise to avoid inclusive range then?

In general, you should use half-open ranges (aka Range), because all else being equal they're the best, as has been known since 1982: E.W. Dijkstra Archive: Why numbering should start at zero (EWD 831)

If you really need inclusive ranges, then use them, but remember that they're fundamentally more complicated -- they have to worry about both the endpoint of the range and the potential for wrapping. So because they're more complicated, you shouldn't expect them to be as fast in every situation.

And as my post above says, if you have a tiny loop body where a few instructions of overhead might matter, use internal iteration instead of for loops. It's not just RangeInclusive where that helps, but also a bunch of iterators adapters too -- Chain being the most common example.

But if you're doing material work in the loop body, it just doesn't matter. Any difference between .. and ..= will completely disappear compared to the cost of reading a file, for example.

4 Likes

I would advise writing code which is easier to maintain first then worry about micro-optimisations second.

Outside of super high performance applications, the performance difference is probably going to be negligible compared to things like disk access or the load on the rest of your system. An off-by-one error is going to give you a lot more headaches than some missed optimisations.

3 Likes

Thanks @scottmcm and @Michael-F-Bryan! I was asking because I've seen some in a code review and I remembered something about them :slight_smile: In the end, they are in a match so I don't think it's the same issue as here (iteration).

Yeah, the range syntax in matches uses the same syntax, but it's entirely different from the RangeIncludsive type used for iteration.

The rule is: forget about microoptimizations and short-living data structures. Think long and hard about APIs and long-living data structures.

Compilers are very good at microoptimizations and even if not — it's easy to go back and fix these.

But if range in embedded in your public API… if it affects not 100 lines of your own code, but thousands (if not millions) of lines of code written by others… then it's time to fight for every nanosecond.

Both because effect is, most likely, multiplied by thousands and millions and because it's much harder to change such things. And also because compilers couldn't fix anything for you in such cases either (and probably wouldn't be able to do that for decades, yet).

2 Likes

Correct. In a pattern or using the contains method there's no material difference. It's only when iterating that perf differences can sometimes show up.

1 Like