Why does Iterator::fold sometimes optimize to a closed-form formula, but a plain for-loop doesn't?

I am not good at English. Sorry if there are any funny expressions.

I've been experimenting with Rust optimizations and noticed an interesting behavior: The fold version is optimized into the closed-form formula n*(n+1)/2.

this is the for loop version:

pub fn sum(n: i32) -> i32 {
    let mut s = 0;
    for i in 1..=n{
	s += i;
	
    }
    s
}

and i use cargo build --release and cargo asm crate::sum to generate the asm

 test    edi, edi
 jle     .LBB0_1
 mov     ecx, 1
 xor     eax, eax
.LBB0_3:
 xor     edx, edx
 cmp     ecx, edi
 setl    sil
 add     eax, ecx
 cmp     ecx, edi
 jge     .LBB0_5
 mov     dl, sil
 add     ecx, edx
 cmp     ecx, edi
 jle     .LBB0_3
.LBB0_5:
 ret
.LBB0_1:
 xor     eax, eax
 ret

this is the fold version:

pub fn sum(n: i32) -> i32 {
    (1..=n).fold(0,|sum,item| sum + item )

}

this is the asm of fold version

 xor     eax, eax
 test    edi, edi
 jle     .LBB1_4
 cmp     edi, 1
 je      .LBB1_3
 lea     eax, [rdi, -, 2]
 lea     ecx, [rdi, -, 3]
 imul    rcx, rax
 shr     rcx
 lea     eax, [rcx, +, 2*rdi]
 add     eax, -3
.LBB1_3:
 add     eax, edi
.LBB1_4:
 ret

My questions are:

  1. Is there something in Rust's compiler (rustc) that specifically recognizes Iterator::fold or Iterator::sum and applies special-case optimizations for ranges?

  2. Or is this purely due to LLVM's backend optimizations (like induction variable analysis)?

  3. Where in the Rust compiler source code or standard library would such an optimization (if it exists) be implemented?

I’d love to understand why iterators sometimes get optimized to closed-form formulas, but plain loops often don’t.

Thanks in advance for any insights!

If instead of an inclusive range 1..=n you use an half-open range 1..(n+1), then also the for-loop version gets optimized with the closed-form formula. See example on Compiler Explorer.

In general, inclusive ranges tend to produce less optimized code.

See also this older but still relevant thread: Performance difference among Range, RangeInclusive and reversed

6 Likes

LLVM definitely have such optimizations and they don't work with closed ranges, because closed ranges are not popular in C.

1 Like

Inclusive ranges (..=) have a long-standing issue with poor optimization. See Big performance problem with closed intervals looping · Issue #45222 · rust-lang/rust · GitHub

The reason this happens is because inclusive ranges need to have a special check for the case where you have something like 1..=i32::MAX, which cannot be expressed with an exclusive range.

With a for loop, rust repeatedly calls .next() on the iterator. Each time, it has to check if the current iteration is the special case of the last iteration. With .fold() for .for_each(), the iterator can loop over 1..n, and then separately do the n iteration one last time, which is easier for LLVM to optimize. This is implemented at range.rs - source

17 Likes

thanks for you all !! :smiley:

The case of "I see you're doing a 1+2+…+N sum, so I'll emit the closed-form" is entirely LLVM. It's just a matter of whether the rust code loop gets translated to one of the forms that LLVM recognizes for it.

2 Likes

I realize that often optimizations work together in surprising ways, but this one specifically really seems like it's just trying to mess up benchmarks :grinning_face_with_smiling_eyes:

1 Like

Clang/LLVM happily optimizes

int res = 0;
for(int i = 0; i <= n; i++) {
    res += i;
}

just like the i < n version. However, the non-range Rust version

let mut res = 0;
let mut i = 0;
while i <= n {
    res += i;
    i += 1;
}

is not optimized to the closed form, even though the i < n version is (godbolt). Which is probably because for whatever reason rustc seems to emit very different LLVM IR in the two cases.

The while i <= n version still produces much better code than the for 0..=n variant, which is pessimized by the need to handle the n == T::MAX edge case.

2 Likes

Couldn't the for case in theory be optimised before LLVM sees it as a for 0..n plus one extra unrolled iteration afterwards? This could be done on MIR or some other high level in Rust where it has the required info. No idea if this would actually be better on average (code density goes down), but it seems worth exploring.

2 Likes

I guess the big question is how often these thing happen in practice. Going from 1 to N is unnatural, numbering should go from zero to N-1. Nothing to do with Rust or optimizations, Edsger Dijkstra wrote about that more than 40 years ago.

As such you would mostly see closed intervals in benchmark contests and not in real code. It's entirely unclear if Rust should optimize for benchmarks.

1 Like

It probably helps that the C version has undefined behavior when n == INT_MAX, so the C compiler can simply assume that case never happens. The closed form is incorrect in the Rust version, because of that specific case.

4 Likes

Yesterday I wrote a program to generate a randomised speaker order for a group of people. I used a 1..N range together with zip for labelling when formatting the list. Not standalone like here, but such ranges do happen when interacting with humans.

Sure. But the principle is the same as with UTF-8: if you deal with anything else you convert from some other, unnatural form (be it UTF-16 or ISO-2022) to UTF-8 ASAP and convert back to that form when you show the result to humans.

Humans are not fast, they couldn't generate or read enough numbers to overwhelm very slow CPU — and all intermediate calculations where you would worry about speed wouldn't use that form.

1 Like

Well yeah, because it's potentially an infinite loop.

3 Likes

The C version has undefined behavior in the case where i overflows. To make the rust code exactly equivalent, you need to do this:

let mut i = 0;
let mut res = 0;
while i <= n {
    res += i;
    i = unsafe { i.unchecked_add(1) };
}
res

This gets optimized into a closed form.

5 Likes

Note that even if you would use -fwrapv it would still include an undefined behavior, just a different one. C/C++ have a forward progress rule that says that any loop that have no side effects have to finish, at some point, which allows optimizer to reason that overflow never happens in C code even when unsigned is involved. But Rust doesn't have a forward progress rule because it would be an UB in safe code. Which means that Rust version can, potentially, loop forever.

4 Likes