Compiler reasoning around a modulo counter

I have a modulo counter -- basically this:

// It doesn't matter that first 0 is skipped
fn next_idx(ctx: &mut Ctx) -> usize {
  ctx.idx = (ctx.idx + 1) % MAX_IDX;
  ctx.idx
}

This works fine and it compiles down to what one expects; using a power-of-two MAX_IDX allows it to, on x86, reduce the core modulo counting down to an inc and an and.

But I stumbled on cycle() and played around with this:

struct Ctx {
  // initialized to (0..MAX_IDX).cycle()
  idxgen: iter::Cyclic<ops::Range<usize>>
}

fn next_idx(ctx: &mut Ctx) -> usize {
  ctx.idxgen.next().unwrap()
}

This compiles down to a fair bit more code. My naive hope, when I first saw cycle() was that the compiler would discover that it's a modulo counter, and deduce that the Option's None case (and thus its unwrap()) is dead code. I assume that the reason this isn't the case is that there's no compile-time information about the range (other than its type, of couse) -- meaning it can't assume that the start and end will not change during runtime.

Is there some way to constify the Range, such that the compiler could understand that it's an infinite iterator which will never yield None?

I took a few random stabs at it and the compiler very explicitly said that what I was trying to do wasn't allowed (and it didn't even say the usual "this is a nightly only feature"). What changes would be needed to allow the compiler to compile down the Cycle version down to the same compact inc/and code?

No XY apart from random curiosity.

An important general rule is: the optimizer does not get to redesign data structures, except by eliminating them.

Range<usize> and Cycle<Range<usize>> are more complex data structures than a single usize, so the code to operate on them will always be more complex — unless the compiler is able to optimize away the entire data structure because it exists inside of a single stack frame with only inlined functions acting on it, so it can be replaced with equivalent local variables (“scalar replacement of aggregates”) that can then be simplified further.

const is irrelevant here unless you start using compile-time calculations to actually pick the data structure layout (using const generic parameters and trait associated types) and achieve the optimization using explicit logic.

compiler doesn't "understand" the code, in the sense that it doesn't analyze the code's intention. it does pattern recognition instead. although modern optimization technology is amazing, if your code does not contain the common patterns the optimizer is based on, it can't magically peel off layers of high level abstraction and "simplify" the code like a human.

With "this" do you mean the code you just showed, or a loop that initialized the cycle with a range from 0 to a power of two and iterates over it?

In other words, next_idx is too generic to allow that optimization, as it must also work for other cases.

Hm. I realize I should have asked my question from the other end instead.

If the rust compiler had compiled down the cycle() version down to the same final machine code as the first one, what would have needed to be different in the standard library and/or language to make that possible?

I realize that the const MAX_IDX helps the compiler out a fair amount, which isn't possible in the Cycle<Range<usize>> case (because the compiler can't assume that the range can't change). But say there was a ConstRange<MIN, MAX> instead -- would that be sufficient to reach the same final code as in the first example?

There's some lore behind this question: Some of the dinosaurs on here have said that iterators and adapter chains often compile down to the same code as you'd get if you write the code using an explicit for loop. Over the years I've tested this theory from time to time using cargo asm and godbolt, and truth be told, I have been genuinely surprised at how often this holds true. When I encountered this modulo counter case, I got curious about what would need to be different (but still allow for (roughly) the same source code (specifically, using Iterator)) for the compiler to arrive at the same final code. Is there a possible future where using a cycle() (or some variant of it) on (0..MAX_IDX) could allow idxgen.next().unwrap() to end up generating the same code as the first example, possibly with other (minor) source code changes (like a hypothetical ConstRange).

I'm also curious where such analysis would occur. I know that optimizers run passes that try to reduce the code into forms where its easier to patter match against known optimizations. But these iterator/adapter chains that look somewhat complex but still end up yielding highly compact machine code -- is that basically all LLVM, or is it up to the rust compiler to reduce it down to something more manageable for LLVM first?

... I should probably take this to internals..

Technically I don't see why the compiler shouldn't be able to deduce that Range::end is always MAX_IDX when all the functions that have access to the private member of the struct set it to MAX_IDX. Even if it doesn't eliminate the memory for the field, it could still optimize code based on that knowledge.

well, it is already very close.

the difference you see in your original observation is mainly the difference in the data structure, as explained by @kpreid, the optimizer cannot modify user's data definition except eliminating the struct entirely.

thus if you look at the next() function in isolation, the optimizer can never transform the Cycle<Range>::next() to be like next_idx(), because the Self type is differently shaped.

however, if you use the iterator in a loop, then the whole data structure can be eliminated, so you get a very clean loop, only a single comparison to MAX to reset the counter to 0, no unwrap(), no panic.

it didn't turn into bit masking operation as the manually written code, but instead a pair of compare and conditional move instructions. practically speaking, I think it's good enough.

not for the next() method in isolation, as explained already. when the iterator is consumed in a for loop? theoretically, it's possible, but I think it's unlikely to happen. currently, the following two form will not generate the same code, neither in rustc nor in clang.

// cycle:
let mut i = 0;
loop {
    f(i);
    i += 1;
    if i >= MAX {
        i = 0;
    }
}

// modulo
let mut i = 0;
loop {
    f(i);
    i += 1;
    i %= MAX
}

I believe it's mostly llvm.

the iterator struct might look complex, but if they are strictly local in a function, not in arguments or return type, a conceptually very simple pass called "sroa" (scalar replacement of aggregates) is applied, which enables further passes including constant folding and dead code elimination to easily do their job.

the "magic" power of an optimizer is essentially an emergent phenominon: each individual steps (passes) are simple and straightfoward in principle, but they can do incredible things when combined together.