Perfomance: by reference vs by value, it seems by value faster, why?

Hello!
I am sorting strings by quick sort algorithm with two slightly different realizations.
First one where I use type vector of strings &[String] and second where I use vector of references to string [&String].
First type vector of strings works much faster. So, I have a question: why using references works so slow?

Below the code. To see huge difference in running time you can increase number of items from 10_000_000 to 100_000_000, for example. It depends on your hardware.
Frist one

fn main() {
    let mut hash = std::collections::HashSet::new();

    let now = std::time::Instant::now();
    for i in 0..10_000_000 {
        hash.insert(i.to_string());
    }
    println!("{:?} insert to hashmap", now.elapsed());

    let now = std::time::Instant::now();
    let mut v: Vec<String> = hash.iter().cloned().collect();
    println!("{:?} collect keys to vector", now.elapsed());

    let now = std::time::Instant::now();
    quick_sort(&mut v);
    println!("{:?} sorting vector", now.elapsed());
}

pub fn quick_sort(arr: &mut [String]) {
    let len = arr.len();
    _quick_sort(arr, 0, (len - 1) as isize);
}

fn _quick_sort(arr: &mut [String], low: isize, high: isize) {
    if low < high {
        let p = partition(arr, low, high);
        _quick_sort(arr, low, p - 1);
        _quick_sort(arr, p + 1, high);
    }
}

fn partition(arr: &mut [String], low: isize, high: isize) -> isize {
    let pivot = high as usize;
    let mut store_index = low - 1;
    let mut last_index = high;

    loop {
        store_index += 1;
        while arr[store_index as usize] < arr[pivot] {
            store_index += 1;
        }
        last_index -= 1;
        while last_index >= 0 && arr[last_index as usize] > arr[pivot] {
            last_index -= 1;
        }
        if store_index >= last_index {
            break;
        } else {
            arr.swap(store_index as usize, last_index as usize);
        }
    }
    arr.swap(store_index as usize, pivot);
    store_index
}

and second one

fn main() {
    let mut hash = std::collections::HashSet::new();

    let now = std::time::Instant::now();
    for i in 0..10_000_000 {
        hash.insert(i.to_string());
    }
    println!("{:?} insert to hashmap", now.elapsed());

    let now = std::time::Instant::now();
    let mut v: Vec<&String> = hash.iter().collect();
    println!("{:?} collect keys to vector", now.elapsed());

    let now = std::time::Instant::now();
    quick_sort(&mut v);
    println!("{:?} sorting vector", now.elapsed());
}

pub fn quick_sort(arr: &mut [&String]) {
    let len = arr.len();
    _quick_sort(arr, 0, (len - 1) as isize);
}

fn _quick_sort(arr: &mut [&String], low: isize, high: isize) {
    if low < high {
        let p = partition(arr, low, high);
        _quick_sort(arr, low, p - 1);
        _quick_sort(arr, p + 1, high);
    }
}

fn partition(arr: &mut [&String], low: isize, high: isize) -> isize {
    let pivot = high as usize;
    let mut store_index = low - 1;
    let mut last_index = high;

    loop {
        store_index += 1;
        while arr[store_index as usize] < arr[pivot] {
            store_index += 1;
        }
        last_index -= 1;
        while last_index >= 0 && arr[last_index as usize] > arr[pivot] {
            last_index -= 1;
        }
        if store_index >= last_index {
            break;
        } else {
            arr.swap(store_index as usize, last_index as usize);
        }
    }
    arr.swap(store_index as usize, pivot);
    store_index
}

A few questions:

  • How are you measuring performance? Are you using a benchmarking tool such as Criterion?
  • Are you running your code in release mode?
4 Likes

&String is a double indirection, since String consists of a pointer to the string data along with a length and capacity, which is then pointed to by the reference. The type &str is generally better when a shared reference to a string is required, since it just stores a length and a pointer to the string data, and &String can be coerced to &str.

4 Likes

I measuring performance in the way above.
yes, I execute my program in release mode

it means if need to sort keys of hashmap then I need to clone them to vector?
It eats lots of memory and it takes some time also

No; if you are constructing a vector of keys so you can sort them, then you should use a Vec<&str>.

let mut v: Vec<&str> = hash.iter().map(|s| s.as_str()).collect();
quick_sort(&mut v);

The .as_str() conversion (which can also be written &*s or &s[..]) just returns the pointer that's inside the String, as the type &str. No cloning of the string characters is involved. This should be, overall, faster than either of your existing options.

3 Likes

it hits perfomance because of how CPU works?
is it related topic?

This type of measurement is quite noisy and unreliable. Energy and heat management of the CPU, OS multitasking, state of caches, branch prediction history can all significantly affect results, by margin more than differences in the code itself. There's also a risk of the optimizer deleting or rearranging the code.

You should use bencher or criterion for measuring small code differences.

1 Like

Well, it causes a performance hit because it is doing more work. A reference has to be dereferenced to obtain the pointed value.

Regardless of how expensive pointer hopping is (due to caching, prefetching, and other techniques), dereferencing then doing stuff is always going to be strictly more than doing the same stuff without having to dereference first.

In most cases, such a simple dereferencing operation can be elided by the compiler (eg. when simply passing a double reference to an inlineable function), but when it's
in the middle of a complex sorting algorithm, the compiler has a much harder job.

1 Like

Yes, I understand that is more work. But not all "more work" can affect perfomance so much.
For example simple incrementing i+=1 does not affect it. I added to my sorting algorithm a counter and then return and print it. And that does not affect perfomance almost.
It means dereferences expensive operation. So why it so expensive?

Memory is incredibly slow relative to the CPU. This is why CPUs have multiple levels of caches. If you structure you program in a way that causes cache misses, it will be a huge penalty. Every dereference accesses something elsewhere in RAM, and therefore is at risk of reading something that isn't already cached.

RAM is so slow, that hyperthreading exists. When the CPU is waiting for RAM to respond, it has so much idle time, that it has time to switch to executing entirely different code from a different program, and switch back before RAM responds.

5 Likes