Printing a variable somehow improving performance?

I've been doing some testing and stumbled upon something interesting.

The following function takes around 1600ms to sort a vector of 100 000 elements

fn sort(values: &mut [u32]) {
    let mut swaps: u64 = 0;
    let mut i = 1;
    let len = values.len();
    while i < len {
        let mut j = i;
        while j > 0 && values[j-1] > values[j] {
            swaps += 1;
            values.swap(j, j-1);
            j -= 1
        }
        i += 1
    }
    //println!("{swaps}")
}

But after uncommenting the print statement, the time drops to 1400ms

fn sort(values: &mut [u32]) {
    let mut swaps: u64 = 0;
    let mut i = 1;
    let len = values.len();
    while i < len {
        let mut j = i;
        while j > 0 && values[j-1] > values[j] {
            swaps += 1;
            values.swap(j, j-1);
            j -= 1
        }
        i += 1
    }
    println!("{swaps}")
}

any thoughts on what could be causing this discrepancy?

1 Like

Hi, the print version actually needs more time, there is something tricky here, when you run code the time will never be "precise", it will have differences between each run, the simples way to get a good metric for this is repeat the operation and get an average, this method is not the best to make good metrics, but helps to get the main idea:

use std::time::SystemTime;

fn get_data() -> Vec<u32> {
        (0..100_000).collect()
}

fn main() {
        let rep = 1000;

        let mut time1 = 0;
        let start = SystemTime::now();
        for _ in 0..rep {
                let mut data = get_data();
                sort(&mut data);
        }
        let end = SystemTime::now();
        time1 += end.duration_since(start).unwrap().as_micros() / rep;

        let mut time2 = 0;
        let start = SystemTime::now();
        for _ in 0..rep {
                let mut data = get_data();
                sort2(&mut data);
        }
        let end = SystemTime::now();
        time2 += end.duration_since(start).unwrap().as_micros() / rep;
        println!("Sort: it took {} micro-seconds", time1);
        println!("Sort with print: it took {} micro-seconds", time2);
}

fn sort(values: &mut [u32]) {
        let mut swaps: u64 = 0;
        let mut i = 1;
        let len = values.len();
        while i < len {
                let mut j = i;
                while j > 0 && values[j - 1] > values[j] {
                        swaps += 1;
                        values.swap(j, j - 1);
                        j -= 1
                }
                i += 1
        }
        //println!("{swaps}")
}

fn sort2(values: &mut [u32]) {
        let mut swaps: u64 = 0;
        let mut i = 1;
        let len = values.len();
        while i < len {
                let mut j = i;
                while j > 0 && values[j - 1] > values[j] {
                        swaps += 1;
                        values.swap(j, j - 1);
                        j -= 1
                }
                i += 1
        }
        println!("{swaps}")
}
Sort: it took 1415 micro-seconds
Sort with print: it took 3129 micro-seconds

So the print line is actually consuming more than x2 the time needed!

Remeber there is always a lot of things that affect how much time will need, other apps running, using ram, cpu, all counts!

2 Likes

Your benchmark likely doesn’t measure exactly what you want, either. First off, the algorithm is O(n^2) on average, O(n) best case, which is if the slice is already sorted, which yours is. So it only goes through the list once to check that everything is sorted, vastly faster than for random input. It might not change the end result (that with print it’s slower) but it can vastly change (or essentially eliminate) the relative difference.

Secondly, the repetition loop is too repetitive; if the optimizer is very smart, it realizes that nothing changes by doing it more than once (or even once!) and removes the entire loop [1]. One should always use std::hint::black_box appropriately to prevent undesired optimizations.

Thirdly, the allocation and initialization of the input happens over and over again while the timer is running, which can take as much time as what you actually want to measure. Definitely in the case where the sort actually has nothing to do.


  1. Not sure if it can be that smart in this case, but it happens more often than you’d think! ↩︎

2 Likes

@latot a couple of things i want to add:
this is the full code i am using (sort_data.csv is 100 000 values between 1 and 100 000)

use std::time::Instant;

fn main() {
    let values = include_str!("../resources/sort_data.csv").trim();

    let values: Vec<u32> = values.split(",").map(|val| val.parse().unwrap()).collect();
    println!("dataset size: {}", values.len());

    let mut total_time_ms = 0;

    const REPETITIONS: usize = 10;
    for _ in 0..REPETITIONS {
        let mut values = values.clone();

        let start = Instant::now();

        sort(&mut values);

        let time = Instant::now().duration_since(start);
        total_time_ms += time.as_millis();

        println!("sorted in: {}ms", time.as_millis())
    }
    println!("average time: {}ms, total: {}ms", total_time_ms / REPETITIONS as u128, total_time_ms)
}


fn sort(values: &mut [u32]) {
    let mut swaps: u64 = 0;
    let mut i = 1;
    let len = values.len();
    while i < len {
        let mut j = i;
        while j > 0 && values[j-1] > values[j] {
            swaps += 1;
            values.swap(j, j-1);
            j -= 1
        }
        i += 1
    }
    //println!("{swaps}")
}

so you can see i am doing something similar to what you did

When it comes to run-to-run variance it was fairly low, around 20ms, nowhere near the 200ms difference with the println.

as @jdahlstrom observed, sorting an already sorted array kind of defeats the purpose

and lastly, the behavior i described only happens on my 12th gen intel machine, and not on my Zen 3 desktop (on it, the sort with println is slower by around 150ms)

1 Like

if the behavior is in general true, it should also work in the scenario I wrote above, if not and still happens means this behavior can be restricted to some circumstances, like hardware or other things.

I would before do a more deep step, run that code but using 1000 repetitions, usually 10 is not statistically enough to get an average, I know 1000 is overkill but I don't want and probably not very useful dig no math to know exactly how much to do, so overkill if we can is also safe :slight_smile:

Post both result pls! if they are still that weird we go deeper on this!
Thx posting more details like hardware, are very important here.

I have now put both functions into one executable, and can no longer replicate the behavior. my guess is that it was either the compiler optimizing something in one of them and not in the other, or the branch predictor liking one function's assembly more than other's (the perf stat output seemed different, but i do not have enough knowledge to interpret it). In any case, if i narrow down what might have been happening I'll update this thread

okis! there is no issue if happens only in two different executables, is good have the list of conditions to be able to replicate and find the source.

The only thing, if you run this, be sure the system is not being used, and do not run both at the same time, or the measures can be not precise to be compared.

Always minimize external effect, and keep circumstances similar on both runs.

It's interesting to compare the assembly of both functions. I made a small change: instead of printing out swaps or not, I either return its value or I return 1 (just to avoid all the code added for println!).

In the code that doesn't use swaps, the optimizer has doubled the code length to (1) add a specific case when len = 2, and (2) unroll the loop to process 2 comparisons by iteration, skipping the last one if the length is odd.

The loop code with swaps is almost the same, though, so there shouldn't be any noticeable difference except if the branch prediction is very unlucky (it shouldn't be a limitation on the CPU side) in one case and not the other, which I'd dismiss given the symmetry.

I see you don't observe the problem any more. Have your data changed, and is/was there any particular pattern to them? Other than that, it must indeed be a measurement error of a problem in the procedure. I doubt that the optimizer will merge both functions even if there's only a small difference at the end.

left: discards swaps, right: returns swaps (thus limiting the loop unrolling)

example::sort:                                    |         
        cmp     rsi, 2                            |          
        jb      .LBB0_13                          |          
        lea     rcx, [rsi - 1]                    |          
        mov     eax, 1                            |          
        cmp     rsi, 2                            |          
        jne     .LBB0_2                           |          
.LBB0_9:                                          |          
        test    cl, 1                             |          
        je      .LBB0_13                          |          
        mov     ecx, dword ptr [rdi + 4*rax]      |          
.LBB0_11:                                         |          
        mov     edx, dword ptr [rdi + 4*rax - 4]  |          
        cmp     edx, ecx                          |          
        jbe     .LBB0_13                          |          
        mov     dword ptr [rdi + 4*rax], edx      |          
        dec     rax                               |          
        mov     dword ptr [rdi + 4*rax], ecx      |          
        jne     .LBB0_11                          |          
.LBB0_13:                                         | example::sort:
        mov     eax, 1                            |         cmp     rsi, 2
        ret                                       |         jb      .LBB0_7
.LBB0_2:                                          |         mov     ecx, 1
        mov     rdx, rcx                          |         xor     eax, eax
        and     rdx, -2                           |         jmp     .LBB0_4
        xor     esi, esi                          | .LBB0_2:
        jmp     .LBB0_3                           |         mov     rax, rdx
.LBB0_8:                                          | .LBB0_3:
        add     rax, 2                            |         inc     rcx
        add     rsi, 2                            |         cmp     rcx, rsi
        cmp     rsi, rdx                          |         je      .LBB0_8
        je      .LBB0_9                           | .LBB0_4:
.LBB0_3:                                          |         lea     rdx, [rax + rcx]
        mov     r8d, dword ptr [rdi + 4*rax]      |         mov     r8d, dword ptr [rdi + 4*rcx]
        mov     r9, rax                           |         mov     r9, rcx
.LBB0_4:                                          | .LBB0_5:
        mov     r10d, dword ptr [rdi + 4*r9 - 4]  |         mov     r10d, dword ptr [rdi + 4*r9 - 4]
        cmp     r10d, r8d                         |         cmp     r10d, r8d
        jbe     .LBB0_5                           |         jbe     .LBB0_3
        mov     dword ptr [rdi + 4*r9], r10d      |         inc     rax
        dec     r9                                |         mov     dword ptr [rdi + 4*r9], r10d
        mov     dword ptr [rdi + 4*r9], r8d       |         dec     r9
        jne     .LBB0_4                           |         mov     dword ptr [rdi + 4*r9], r8d
.LBB0_5:                                          |         test    r9, r9
        mov     r8d, dword ptr [rdi + 4*rax + 4]  |         jne     .LBB0_5
        mov     r9, rax                           |         jmp     .LBB0_2
.LBB0_6:                                          | .LBB0_7:
        mov     r10d, dword ptr [rdi + 4*r9]      |         xor     eax, eax
        cmp     r10d, r8d                         | .LBB0_8:
        jbe     .LBB0_8                           |         ret     
        mov     dword ptr [rdi + 4*r9 + 4], r10d  |
        mov     dword ptr [rdi + 4*r9], r8d       |
        add     r9, -1                            |
        jb      .LBB0_6                           |
        jmp     .LBB0_8                           |

1 Like

The data didn't change, the only thing i did is the way i was running the function (earlier i would compile and run, edit the file to add/remove the print statement and compile and run again, later i have made two separate functions with and without the print statement, and run them from one file). Tomorrow i will try to replicate what happened, and will try to get the assembly of both

The 12th-gen Intel chips have two different flavors of processing cores. I wonder if there's something about the difference that causes the OS to schedule the non-printing program on an efficiency core but the printing one on a performance core-- That could easily account for the difference.

1 Like

Which CPU do you have? Some Intel CPUs have two types of cores, IIRC the E ones and normal ones, if the cpu is choosing different cores to run this can happens this.

For the tests, try run them using this: https://stackoverflow.com/questions/21153084/start-a-process-on-a-specific-core-cpu

Now can isolate this aspect :slight_smile:

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.