What the difference between these 2 XOR impls?

pub fn xor1(dst: &mut [u8], src: &[u8]) {
    // assert!(dst.len() <= src.len());
    let len = dst.len();
    let mut dst_ptr = dst.as_mut_ptr();
    let mut src_ptr = src.as_ptr();
    for _ in 0..len {
        unsafe {
            *dst_ptr ^= *src_ptr;
            src_ptr = src_ptr.offset(1);
            dst_ptr = dst_ptr.offset(1);
        }
    }
}

pub fn xor2(dst: &mut [u8], src: &[u8]) {
    // assert!(dst.len() <= src.len());
    for i in 0..dst.len() {
        dst[i] ^= src[i];
    }
}

playground
The first version is the original code and the second version is a straightforward safe implementation. The best practice I know is that we should write as straightforward code as possible and leave it to the compiler to optimize, so I prefer to use the safe version. The asm generated by the two versions is quite different, both look like made use of simd. Semantically it should be equivalent, and the comparison test for the whole crate passed. I didn't do a benchmark, so I don't know the performance difference between them. Here are some thoughts on my attempts, but I don't know the extent of compiler optimizations either.
The unsafe version only has a current position, each time XOR the current position, then makes the current position plus 1, but wastes the range iterator variable (If we are to go deeper on this path, maybe it performs better to replace it with a do-while).
The safe version has an initial position (in dst slice), a current relative position, each time check the current relative position has not exceeded the maximum position (of dst slice) (this check is a waste, and I can still find this in asm even if I add the commented out assert), XOR the initial position plus the current relative position, and then make the current relative position plus 1.

1 Like

With the asserts commented out, xor1 has UB if src isn't long enough, where xor2 will panic; this may be the source of the codegen differences. This safe version seems to output assembly roughly equivalent to xor1:

pub fn xor3(dst: &mut [u8], src: &[u8]) {
    for (d,s) in std::iter::zip(dst, src) {
        *d ^= s;
    }
}
5 Likes

Which one should we prefer (in the original code context)?

Unless you have a very good reason to, you shouldn't use unsafe code for something as trivial as this. Just write the safe, straightforward thing first. If you use the zip version presented above, it will probably optimize to the same assembly as the manual unsafe version, as demonstrated.

5 Likes

I certainly would not use 'unsafe' in any of my regular application level code. And not ever unless it is really necessary.

For once I prefer the functional programming style version shown by 2e71828 to the old C style loop of the original post.

1 Like

(Whisper: The original version came from the most downloaded Keccak/SHA3 library on crates.io, which I gave a link to at the beginning, and then I was adapting a version with some customization. While scanning the unsafe code I found a XOR implementation as the first version and changed it to the second version. Maybe the reason was that the code was translated from C and the original translator didn't know rustic way to do this.)

Sounds like time to benchmark all these versions. Let's see if there is a performance advantage anywhere.

It's also possible that it was automatically translated using a C -> Rust transpiler. (Given that all of C is unsafe, and there are no array bounds checks, it's likely that such a machine translation would result the raw pointer arithmetic it means in C.)

2 Likes

BTW, an assert! before a loop using [i] can improve quality of the generated code. LLVM can notice that the bounds check is moved outside of the loop, and won't do it in the loop. LLVM can't automatically hoist bounds checks out of the loop itself, because writes done in the loop before a panic are an observable side effect.

6 Likes

This is not something you should ever worry about. For trivial small no-Drop-needed types like u32 or Range<usize>, the optimizer will perfectly remove them every time.

The core thing you need to think about is what you're assuming as a human: that the indexes will be in-bounds. The compiler, though, needs to handle all the cases where they might not be in-bounds, most importantly that it can't do an out-of-bounds read on src.

Two previous comments I've written about this, for more information:

  1. Rust auto-vectorisation difference? - #2 by scottmcm
  2. Understanding Rusts Auto-Vectorization and Methods for speed increase - #5 by scottmcm

So what you want is to say what you expect as part of the code.

The general strategy here is what I call re-slicing, which would look like this:

pub fn xor3(dst: &mut [u8], src: &[u8]) {
    let n = dst.len();

    // Check *before* looping that both are long enough,
    // in a way that makes it directly obvious to LLVM
    // that the indexing below will be in-bounds.
    let (dst, src) = (&mut dst[..n], &src[..n]);

    for i in 0..n {
        dst[i] ^= src[i];
    }
}

Which vectorizes as expected: https://rust.godbolt.org/z/sqhaoMbP7.

Note that this is subtly different from the zip approach. The zip approach is like using let n = std::cmp::min(src.len(), dst.len());. Whereas the reslicing approach will still panic for dst.len() > src.len(): the panic will just happen before the loop instead of inside it.

Because of optimizers, adding more checks can actually make code faster.

7 Likes

Found the original PR while doing more refactoring and claiming it was 35%-50% faster in benchmark???
But this PR is from 7 years ago, has the compiler improved anything about this in the meantime?

Yes, zip codegen is much better now, thanks to work from @the8472 and others. Not to mention that 7 years ago is before we could turn on the noalias annotations, as LLVM had bugs in it, which have since been fixed.

Also, that predates the internal iteration helpers.

The old code was

        fn xorin(dst: &mut [u8], src: &[u8]) {
            for (d, i) in dst.iter_mut().zip(src) {
                *d ^= *i;
            }
        }

And nowadays the simple way to more likely get nicer codegen from that is just to avoid the external iteration, aka to do

        fn xorin(dst: &mut [u8], src: &[u8]) {
            dst.iter_mut().zip(src).for_each(|(d, i)| {
                *d ^= *i;
            });
        }

Because the body is so tiny, both in tokens and in behaviour. (The bigger the body the less this matters -- don't just blindly do it for every loop!)

It doesn't actually matter today, though, as they both do the same thing exactly thanks to core library smarts: https://rust.godbolt.org/z/v9jbT6PhW.

Zip actually has no optimizations for internal iteration. The heavy lifting is all done by the horrendously unsafe specializations for next(), i.e. external iteration. We'd need different for-loop desugaring to get rid of that.
It's more things like Flatten and Chain that benefit from internal iteration.

1 Like

So best practice for today is

or

?
Currently I'm using the first one.

I think what I'd suggest right now, assuming you want the "shorter of the two" behaviour, is

        fn xorin(dst: &mut [u8], src: &[u8]) {
            std::iter::zip(dst, src).for_each(|(d, i)| *d ^= *i);
        }

The reslicing is great as a general approach for more complex scenarios, but when all you need is a zip, just use it.

(We can't necessarily address all the bigger combinations of stuff, but something simple like this over slice iterators is the kind of thing that we'll keep working well.)

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.