What the difference between these 2 XOR impls?

This is not something you should ever worry about. For trivial small no-Drop-needed types like u32 or Range<usize>, the optimizer will perfectly remove them every time.

The core thing you need to think about is what you're assuming as a human: that the indexes will be in-bounds. The compiler, though, needs to handle all the cases where they might not be in-bounds, most importantly that it can't do an out-of-bounds read on src.

Two previous comments I've written about this, for more information:

  1. Rust auto-vectorisation difference? - #2 by scottmcm
  2. Understanding Rusts Auto-Vectorization and Methods for speed increase - #5 by scottmcm

So what you want is to say what you expect as part of the code.

The general strategy here is what I call re-slicing, which would look like this:

pub fn xor3(dst: &mut [u8], src: &[u8]) {
    let n = dst.len();

    // Check *before* looping that both are long enough,
    // in a way that makes it directly obvious to LLVM
    // that the indexing below will be in-bounds.
    let (dst, src) = (&mut dst[..n], &src[..n]);

    for i in 0..n {
        dst[i] ^= src[i];
    }
}

Which vectorizes as expected: https://rust.godbolt.org/z/sqhaoMbP7.

Note that this is subtly different from the zip approach. The zip approach is like using let n = std::cmp::min(src.len(), dst.len());. Whereas the reslicing approach will still panic for dst.len() > src.len(): the panic will just happen before the loop instead of inside it.

Because of optimizers, adding more checks can actually make code faster.

7 Likes