Performance difference among Range, RangeInclusive and reversed

irrigator · September 9, 2018, 7:27am

I ran this simple code:

  #[test]
  fn play() {
    let now = Instant::now();
    let mut result: usize = 0;
    for j in (1..100_0000 + 1).rev() {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());

    let now = Instant::now();
    let mut result: usize = 0;
    for j in (1..=100_0000).rev() {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());

    let now = Instant::now();
    let mut result: usize = 0;
    for j in 1..100_0001 {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());

    let now = Instant::now();
    let mut result: usize = 0;
    for j in 1..=100_0000 {
      result += j
    }
    println!("{}", result);
    println!("{:?}", now.elapsed());
  }

and got the output:

running 1 test
500000500000
23.868379ms
500000500000
53.537635ms
500000500000
54.150509ms
500000500000
51.44193ms

Why is the reversed Range (exlusive) version faster than others?

DanielKeep · September 9, 2018, 8:04am

Depends on who you ask.

   Compiling playground v0.0.1 (file:///playground)
    Finished release [optimized] target(s) in 1.32s
     Running `target/release/playground`

Standard Output

500000500000
28.772µs
500000500000
102.534µs
500000500000
1.828µs
500000500000
1.008266ms

(That was on the playpen, by the way).

irrigator · September 9, 2018, 8:22am

Oh, I forgot to add the --release when running test!. After running:

cargo test play --release -- --nocapture

the result is:

running 1 test
500000500000
8.832µs
500000500000
1.879µs
500000500000
1.777µs
500000500000
1.754µs

The first one is now the slowest!

matklad · September 9, 2018, 8:44am

As usual with benchmarks, it's useful to verify that you are measuring what you think you are measuring. LLVM has no trouble in recognizing arithmetic progression and using Gauss trick to evaluate it at compile time, so, in the assembly output, there are no loops at all, only 500000500000 constant.

So the time here is the syscall time to output stuff to stdout, and microsecond range for syscalls is reasonable. (does anyone knows of a more modern measurements than Tsuna's blog: How long does it take to make a context switch? ?)

EDIT: apparently I am an idiot, and assembly contains loops for the second two cycles?

matklad · September 9, 2018, 9:01am

Hm, so stable Rust indeed fails to optimze the inclusive loops, but its all-ok on nightly.

I've used the following program:

pub fn main() {
    let mut result: usize = 0;
    for j in (1..100_0000 + 1).rev() {
      result += j
    }
    bb(result);

    let mut result: usize = 0;
    for j in (1..=100_0000).rev() {
      result += j
    }
    bb(result);

    let mut result: usize = 0;
    for j in 1..100_0001 {
      result += j
    }
    bb(result);

    let mut result: usize = 0;
    for j in 1..=100_0000 {
      result += j
    }
    bb(result);
}

#[inline(never)]
fn bb(r: usize) {
    println!("{}", r);
}

And the following command-line to get the optimized ir:

 rustc +nightly -C opt-level=3 --emit=llvm-ir main.rs

We see everything evaluated at compile time in the result:

λ rg 'main::main' -A 13 main.ll
59:; main::main
60-; Function Attrs: uwtable
61-define internal void @_ZN4main4main17hd31e54a560abace9E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
62-start:
63-; call main::bb
64-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
65-; call main::bb
66-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
67-; call main::bb
68-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
69-; call main::bb
70-  tail call fastcc void @_ZN4main2bb17h7162672b47b28d02E(i64 500000500000)
71-  ret void
72-}

matklad · September 9, 2018, 9:42am

To elaborate on this a bit, this does not mean that you should immediately start peering into the assembly output. A very simple but very powerful tool to verify if the measurements themselves are correct is to vary the size of the input and check that the timings vary according to predicted asymptotic complexity. In this case, that increasing the number 10 times will lead to 10 times increase in execution time. This helps to catch cases when you don't measure the relevant thing at all, or when the constant overhead of the measurement dwarfs the time of the algorithm itself.

vitalyd · September 9, 2018, 2:24pm

Measuring context switching and memory overheads for Linux threads - Eli Bendersky's website was recently written (and Eli is a reputable source )

matklad · September 9, 2018, 3:15pm

ohhh, I’ve wanted exactly this blog post for about two years I think, thank you so much for the link!

kornel · September 9, 2018, 9:47pm

I'd recommend using Bencher (#[bench]) for such tests, because it ensures tests are ran long enough to be meaningful. A single run measured with Instant::now() may be too noisy.

scottmcm · September 9, 2018, 10:23pm

This is a good opportunity to remind people that for loops over iterators are sometimes slower (and never faster) than using internal iteration methods on those iterators. (Usual link: Rust’s iterators are inefficient, and here’s what we can do about it. | by Veedrac | Medium)

A quick demo: Rust Playground

fn test_iter() -> impl Iterator<Item=u32> {
    (0..100).chain(200..300)
}

pub fn sum_iterator_method() -> u32 {
    test_iter().sum()
}

pub fn sum_for_loop() -> u32 {
    let mut sum = 0;
    for x in test_iter() {
        sum += x;
    }
    sum
}

LLVM can completely remove all the loops in the iterator method:

playground::sum_iterator_method:
	mov	eax, 29900
	ret

But with the for loop it cannot:

playground::sum_for_loop:
	mov	edi, 200
	xor	ecx, ecx
	xor	esi, esi
	xor	eax, eax
	mov	edx, esi
	and	dl, 3
	cmp	dl, 1
	jne	.LBB2_2
	jmp	.LBB2_6

.LBB2_1:
	add	eax, ecx
	lea	ecx, [rcx + 1]
	mov	edx, esi
	and	dl, 3
	cmp	dl, 1
	je	.LBB2_6

.LBB2_2:
	cmp	dl, 2
	jne	.LBB2_7
	cmp	edi, 299
	ja	.LBB2_10
	add	eax, edi
	lea	edx, [rdi + 1]
	jmp	.LBB2_5

.LBB2_7:
	cmp	ecx, 100
	jb	.LBB2_1
	cmp	edi, 299
	ja	.LBB2_10
	add	eax, edi
	lea	edx, [rdi + 1]
	mov	sil, 2

.LBB2_5:
	mov	edi, edx
	mov	edx, esi
	and	dl, 3
	cmp	dl, 1
	jne	.LBB2_2

.LBB2_6:
	cmp	ecx, 99
	jbe	.LBB2_1

.LBB2_10:
	ret

TL/DR: If your loop body is very simple, consider phrasing it as fold/try_fold/for_each/etc instead. It won't be slower, and might be faster, depending on the iterator type.

Geobomatic · January 7, 2022, 9:02am

Sorry to revive this post as it is more than 3 years old, but is it still true that RangInclusive is not correctly optimized compared to Range?

alice · January 7, 2022, 9:03am

It is certainly correctly optimized, but the optimizations may be worse.

Geobomatic · January 7, 2022, 9:24am

So it is still advise to avoid inclusive range then?

scottmcm · January 7, 2022, 9:08pm

In general, you should use half-open ranges (aka Range), because all else being equal they're the best, as has been known since 1982: E.W. Dijkstra Archive: Why numbering should start at zero (EWD 831)

If you really need inclusive ranges, then use them, but remember that they're fundamentally more complicated -- they have to worry about both the endpoint of the range and the potential for wrapping. So because they're more complicated, you shouldn't expect them to be as fast in every situation.

And as my post above says, if you have a tiny loop body where a few instructions of overhead might matter, use internal iteration instead of for loops. It's not just RangeInclusive where that helps, but also a bunch of iterators adapters too -- Chain being the most common example.

But if you're doing material work in the loop body, it just doesn't matter. Any difference between .. and ..= will completely disappear compared to the cost of reading a file, for example.

Michael-F-Bryan · January 7, 2022, 10:47pm

I would advise writing code which is easier to maintain first then worry about micro-optimisations second.

Outside of super high performance applications, the performance difference is probably going to be negligible compared to things like disk access or the load on the rest of your system. An off-by-one error is going to give you a lot more headaches than some missed optimisations.

Geobomatic · January 8, 2022, 10:28am

Thanks @scottmcm and @Michael-F-Bryan! I was asking because I've seen some in a code review and I remembered something about them In the end, they are in a match so I don't think it's the same issue as here (iteration).

alice · January 8, 2022, 10:47am

Yeah, the range syntax in matches uses the same syntax, but it's entirely different from the RangeIncludsive type used for iteration.

VorfeedCanal · January 8, 2022, 11:28am

The rule is: forget about microoptimizations and short-living data structures. Think long and hard about APIs and long-living data structures.

Compilers are very good at microoptimizations and even if not — it's easy to go back and fix these.

But if range in embedded in your public API… if it affects not 100 lines of your own code, but thousands (if not millions) of lines of code written by others… then it's time to fight for every nanosecond.

Both because effect is, most likely, multiplied by thousands and millions and because it's much harder to change such things. And also because compilers couldn't fix anything for you in such cases either (and probably wouldn't be able to do that for decades, yet).

scottmcm · January 8, 2022, 10:57pm

Correct. In a pattern or using the contains method there's no material difference. It's only when iterating that perf differences can sometimes show up.

Topic		Replies	Views
2xtimes slower than cpp	22	941	November 28, 2025
Why doesn't rustc eliminate this empty loop?	6	701	April 25, 2022
Is Nim faster than Rust? code review	40	1359	February 6, 2026
A performance problem compared with Julia	22	3488	February 23, 2021
We all know `iter` is faster than `loop`, but why?	23	14248	February 16, 2021

Performance difference among Range, RangeInclusive and reversed

Related topics