Function compiles differently when an extra function is present

I was trying to determine if a simple for loop is faster than the map functionality. I used AOC 2023 day 01 for this and criterion to measure speed.

I was flabbergasted with the results:
Test 1: For-Loop: has a runtime X which is 100% as baseline.
Test 2: map: runs 10% slower (that is maybe expected)
Test 3: Same For-Loop runs 10% slower when an extra function using map present.
It seems like the for loop is automatically rewritten to the map logic.

You can see this in the godbolt output for function 'process' when you remove the function 'process_map'. It is compiled to very different assembly.
In my simple mind there should not be a difference in Test 1 and 3, as the exact same code is used.

Looking at godbolt assembly output, with compiler arguments -C opt-level=3, the same function compiles to different code IF another function is present. The for loop now is enhanced with map functionality while that is not part of the coding.

Questions:
a) How is this possible?
b) Can I prevent it?

// This uses a simple for loop instead of map: Assembly does not have 'map' if function process_map is not part of the coding.
pub fn process(input: &str) -> String {
    let mut result_sum: u32 = 0;
    for line in input.lines() {
        // find first digit in line
        let mut value: u8 = 0;
        for byte in line.bytes() {
            // assume ASCII
            if byte <= b'9' && byte >= b'1' {
                value = (byte - b'0') * 10;
                break;
            }
        }

        // find last digit in line
        // since digit was found, no more validation
        for byte in line.bytes().rev() {
            // assume ASCII
            if byte <= b'9' && byte >= b'1' {
                value += byte - b'0';
                break;
            }
        }

        result_sum += value as u32;
    }
    // dbg!(result_sum);

    result_sum.to_string()
}

pub fn process_map(input: &str) -> String {
    let result_sum: u32 = input.lines()
        .map(|line| {
            let mut value: u8 = 0;
            for byte in line.bytes() {
                // assume ASCII
                if byte <= b'9' && byte >= b'1' {
                    value = (byte - b'0') * 10;
                    break;
                }
            }

            // find last digit in line
            // since digit was found, no more validation
            for byte in line.bytes().rev() {
                // assume ASCII
                if byte <= b'9' && byte >= b'1' {
                    value += byte - b'0';
                    break;
                }
            }

            value as u32

    })
    .sum();
    // dbg!(result_sum);

    result_sum.to_string()
}

1 Like

I'm not a rust compiler dev, but str::lines() internally uses an interator that first splits, and then maps to strip the '\n'. So the performance penalty in Test 3 might come from lines not being inlined, but shared between process and process_map.

But this is just my dumb guess after quick skim through source code and produced asm.

Just a nitpicking hint!
You should write this:

value = (byte - 48) * 10;

... as

value = (byte - b'0') * 10;
1 Like

The issue is, the function 'process' compiles differently when the function 'process_map' is present. No code change, but 10% more runtime just because another function is in the code base.

Both functions call some of the same standard library functions like str::lines. The decision to inline a function or not is also influenced by the number of call locations. If a function is only called in one location it is very likely that it will be inlined.

It seems like some internal split_inclusive function is inlined if process_map is not present:

b) Can I prevent it?

The compiler optimizes based on heuristics and many optimizations like inlining are a tradeoff where the benefit depends on the call pattern and input data. The heuristics typically change with each compiler release and it might be that your code performs worse through a change the benefitted most other users.

Generally, you should look into profile guided optimization to aid the compiler in such decisions.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.