Why are &mut parameters faster than moving values?

In general and in my opinion, a function with "value signature", separate input and return values, like fn f(old: T) -> T is more readable than a function with an in-out-argument like fn f(t: &mut T).

Deep down in the implementation of BTreeMap, in node.rs, there are a lot of functions with value signature (right_kv, left_edge...). They support public functions in iterators that work on mutable fields. At some point, we go from &mut fields to owned values, as in:

fn f(x: &mut T) {
    x.y = lower_f(x.y);
}

except that this doesn't compile ( "cannot move out of x.y which is behind a mutable reference", so we spice it up with ptr::read (like in the move_on_up_val_unsafe below).

But there's also some genuine logic in the iterator code, so we can vary the point at which we go from &mut fields to owned values. When I favour value semantics in this iterator code, performance drops 10% (¹). I kind of understand why: passing a &mut is closer to passing raw pointers and safe code can't beat that, once you managed to write it correctly.

Still, looking for a holy grail to improve readability and keep up performance, I tried to capture this situation in much simpler code:

#![cfg(test)]

const BITS: usize = 12;

struct WrapperWrapper {
    bitswrapper: BitsWrapper,
}

struct BitsWrapper {
    bits: [bool; BITS],
}
const ZERO: BitsWrapper = BitsWrapper {
    bits: [false; BITS],
};

fn increment(bits_in: &[bool; BITS]) -> [bool; BITS] {
    // break down integer into bits, because optimizers cancel out a simple integer increment
    let mut carry = true;
    let mut bits_out = [false; BITS];
    for n in 0..BITS {
        if bits_in[n] == carry {
            bits_out[n] = false;
        } else {
            bits_out[n] = true;
            carry = false;
        }
    }
    bits_out
}

fn increment_val(w: BitsWrapper) -> BitsWrapper {
    BitsWrapper {
        bits: increment(&w.bits),
    }
}

fn increment_mut(w: &mut BitsWrapper) {
    w.bits = increment(&w.bits);
}

fn move_on_up_val(w: &mut WrapperWrapper) {
    w.bitswrapper = increment_val(std::mem::replace(&mut w.bitswrapper, ZERO));
}

unsafe fn move_on_up_val_unsafe(w: &mut WrapperWrapper) {
    w.bitswrapper = increment_val(core::ptr::read(&w.bitswrapper));
}

fn move_on_up_mut(w: &mut WrapperWrapper) {
    increment_mut(&mut w.bitswrapper);
}

#[bench]
pub fn more_val(b: &mut test::Bencher) {
    b.iter(|| {
        let mut w = WrapperWrapper { bitswrapper: ZERO };
        while !w.bitswrapper.bits[BITS - 1] {
            move_on_up_val(&mut w);
        }
        w
    })
}

#[bench]
pub fn more_val_unsafe(b: &mut test::Bencher) {
    b.iter(|| unsafe {
        let mut w = WrapperWrapper { bitswrapper: ZERO };
        while !w.bitswrapper.bits[BITS - 1] {
            move_on_up_val_unsafe(&mut w);
        }
        w
    })
}

#[bench]
pub fn more_mut(b: &mut test::Bencher) {
    b.iter(|| {
        let mut w = WrapperWrapper { bitswrapper: ZERO };
        while !w.bitswrapper.bits[BITS - 1] {
            move_on_up_mut(&mut w);
        }
        w
    })
}

Output on my machine:

test val_over_mut::more_mut                          ... bench:       9,145 ns/iter (+/- 278)
test val_over_mut::more_val                          ... bench:      31,873 ns/iter (+/- 990)
test val_over_mut::more_val_unsafe                   ... bench:      30,224 ns/iter (+/- 513)

As expected, it's worth while to use ptr::read instead of mem::replace. But it's a small difference (though more pronounced for higher values of BITS) compared to the vast disadvantage of value semantics here (²). What can we do to bring move_on_up_val up to par with move_on_up_mut or why not?

(¹) The use of ptr::read and the 10% performance loss is for mutable and owned maps. For immutable maps, the conversion happens through implicit copies, thus probably safe code, although the functions doing so are marked unsafe.
(²) By the way, the difference holds for other values of BITS, it's more pronounced for BITS=11, and it mysteriously vanishes for BITS=10 only.

I think your 2nd comment is the key. Optimizers should, in theory, have no trouble bringing the by-value code on par with the by-ref-mut code, at least I don't see any fundamental reason, especially for such simple code. It may well be the case that some heuristics inside rustc or LLVM have a built-in arbitrary constant and for BITS=10, some (arbitrary) complexity measure of your code happens to fit into what is maximally allowed by said heuristic, but for larger values, the optimizer just throws up its hands and says "who has the time for this".

You might also want to play with different optimization settings. For example, when I compile with -C opt-level=z, all of a sudden increment_val and increment_mut compile to almost identical assembly.

2 Likes

Well, I first thought that's not quite right. I didn't make that clear but for lower values, the difference is again spectacular. Though the gap narrows, 6 is also a sweet spot, then the gap returns with a vengeance at 5. But now I learnt how to disassemble this, on closer inspection, for BITS=10 and with -O, move_on_up_val_unsafe gets inlined completely but move_on_up_mut doesn't. Down at BITS=9, the latter gets inlined too, so it wins again. Considering that we're only feeding the functions used to godbolt, not the benchmark code, that's enough to explain the BITS=10 and BITS=6 anomalies for me.

But for BITS 11 and higher, with -O, the optimizer throws up its hands for both. There are obvious differences with non-optimized code, but no inlining. The difference is that move_on_up_val_unsafe operates on an additional 12 byte copy of the bits.

That's a great observation and it sheds light on yet another important detail: the compiler has a quite good and complex heuristic-probabilistic model of the target architecture, but it's not 100% accurate. Nor does it have all the necessary information of all of the environment that affects the actual, real-life performance of a particular run of the program. Consequently, the optimizer sometimes needs to make a bad or simply underinformed decision, and performs a transformation that makes the code slower even though it was supposed to make it faster.

So, I'm not saying that "for sizes N < K, the code will run faster". I'm only saying that "for sizes N < K, the compiler will optimize differently (or at all)".

(Note that I might also be wrong about this in general, as optimizers nowadays are very complex beasts. But at least what you observed in this particular case seems to support this theory of mine.)

1 Like

I think that passing the bits as a slice and maybe the fact that this increment function is too simple, causes the benchmark to measure nothing but the quality of inlining, rather than the effect of passing data by value or reference. When I try passing the bits by value, or as a tuple, the difference disappears (except that the safe variant with mem::replace costs a little extra).

Must think of some other way to reproduce the BTreeMap experience..