XORing 128-bit Blocks - Function with nearly identical asm takes 5x as long

Hey all,
I'm writing a little abstraction over 128-bit blocks which should provide the best possible performance for tasks such as XORing two slices of these blocks in place and provide some convenience methods such as converting a &Block to a &GenericArray to use it with the aes crate.
I wondered what would be the best option of storing the bits and whether it makes sense to conditionally use SIMD instructions when available.
I came up with the following options which I tested:

#[repr(align(16))]
pub struct BlockBytes {
    data: [u8; 16]
}

#[repr(align(16))]
pub struct BlockU128 {
    data: u128
}

#[repr(align(16))]
pub struct BlockSse {
    data: __m128i
}

and corresponding xor_blocks functions which take a: &mut [Block] and b: &[Block] and XOR them in-place.

xor_blocks functions
pub fn xor_blocks_bytes(a: &mut [BlockBytes], b: &[BlockBytes]) {
    a.iter_mut().zip(b).for_each(|(a, b)| {
        a.data.iter_mut().zip(b.data).for_each(|(a, b)| *a ^= b);
    })
}

pub fn xor_blocks_u128(a: &mut [BlockU128], b: &[BlockU128]) {
    a.iter_mut().zip(b).for_each(|(a, b)| {
        a.data ^= b.data;
    })
}

pub fn xor_blocks_sse(a: &mut [BlockSse], b: &[BlockSse]) {
    a.iter_mut().zip(b).for_each(|(a, b)| {
        a.data = unsafe { _mm_xor_si128(a.data, b.data) };
    })
}

The functions are benchmarked with criterion.

Benchmark code
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::{Rng, thread_rng};
use rand::distributions::{Standard};
use mpc_block::{BlockBytes, BlockSse, BlockU128, xor_blocks_bytes, xor_blocks_sse, xor_blocks_u128};

pub fn blocks_xor(c: &mut Criterion) {
    let mut a: Vec<_> = thread_rng().sample_iter(Standard).map(BlockU128::new).take(1_000_000).collect();
    let b: Vec<_> = thread_rng().sample_iter(Standard).map(BlockU128::new).take(1_000_000).collect();
    c.bench_function("BlockU128 xor", |bencher| bencher.iter(||
        { xor_blocks_u128(black_box(&mut a), black_box(&b)) }
    ));

    let mut a: Vec<_> = thread_rng().sample_iter(Standard).map(BlockBytes::new).take(1_000_000).collect();
    let b: Vec<_> = thread_rng().sample_iter(Standard).map(BlockBytes::new).take(1_000_000).collect();
    c.bench_function("BlockBytes xor", |bencher| bencher.iter(||
        { xor_blocks_bytes(black_box(&mut a), black_box(&b)) }
    ));

    let mut a: Vec<_> = thread_rng().sample_iter(Standard).map(BlockSse::new).take(1_000_000).collect();
    let b: Vec<_> = thread_rng().sample_iter(Standard).map(BlockSse::new).take(1_000_000).collect();
    c.bench_function("BlockSse xor", |bencher| bencher.iter(||
        { xor_blocks_sse(black_box(&mut a), black_box(&b)) }
    ));
}

criterion_group!(benches, blocks_xor);
criterion_main!(benches);

With these results:

BlockU128 xor           time:   [2.6061 ms 2.6313 ms 2.6595 ms]
BlockBytes xor          time:   [12.973 ms 13.095 ms 13.250 ms]
BlockSse xor            time:   [2.4989 ms 2.5125 ms 2.5265 ms]

The u128 and sse implementations are usually very close together. The bytes implementation is consistently 5 times slower than the other two.

Now to the confusing part:
The bytes and sse implementations are both compiled to use SIMD instructions and have nearly identical asm code. The u128 operates on on pairs of qword values to XOR one u128 value.
Godbolt link with asm

Diff of the bytes and sse asm
-example::xor_blocks_bytes:
+example::xor_blocks_sse:
         cmp     rsi, rcx
         cmova   rsi, rcx
         test    rsi, rsi
-        je      .LBB1_7
+        je      .LBB0_7
         cmp     rsi, 1
-        jne     .LBB1_3
+        jne     .LBB0_3
         xor     eax, eax
-        jmp     .LBB1_5
-.LBB1_3:
+        jmp     .LBB0_5
+.LBB0_3:
         mov     r8, rsi
         and     r8, -2
         mov     ecx, 16
         xor     eax, eax
-.LBB1_4:
-        movaps  xmm0, xmmword ptr [rdi + rcx - 16]
-        xorps   xmm0, xmmword ptr [rdx + rcx - 16]
-        movaps  xmm1, xmmword ptr [rdi + rcx]
+.LBB0_4:
+        movaps  xmm0, xmmword ptr [rdx + rcx - 16]
+        xorps   xmm0, xmmword ptr [rdi + rcx - 16]
         movaps  xmmword ptr [rdi + rcx - 16], xmm0
-        xorps   xmm1, xmmword ptr [rdx + rcx]
+        movaps  xmm0, xmmword ptr [rdx + rcx]
+        xorps   xmm0, xmmword ptr [rdi + rcx]
         add     rax, 2
-        movaps  xmmword ptr [rdi + rcx], xmm1
+        movaps  xmmword ptr [rdi + rcx], xmm0
         add     rcx, 32
         cmp     r8, rax
-        jne     .LBB1_4
-.LBB1_5:
+        jne     .LBB0_4
+.LBB0_5:
         test    sil, 1
-        je      .LBB1_7
+        je      .LBB0_7
         shl     rax, 4
-        movaps  xmm0, xmmword ptr [rdi + rax]
-        xorps   xmm0, xmmword ptr [rdx + rax]
+        movaps  xmm0, xmmword ptr [rdx + rax]
+        xorps   xmm0, xmmword ptr [rdi + rax]
         movaps  xmmword ptr [rdi + rax], xmm0
-.LBB1_7:
+.LBB0_7:
         ret

I'm not particularly well versed in reading assembly, but to me, it seems like the only difference (aside of the label names) is that, in the unrolled part of the loop, the bytes implementation first reads a block from a and then from b and uses two registers xmm0 and xmm1 instead of just one. I would be really surprised if this minor difference resulted in a 5x slowdown.

I also executed cargo asm locally to check that I get the same asm as shown by godbolt and also executed RUSTFLAGS="--emit asm -C llvm-args=-x86-asm-syntax=intel" cargo bench --no-run and checked the asm of the actual benches binary (I think) to ensure that no funky optimizations are messing up my benchmark.

So, I guess this post boils down to these questions:

  • Why is the xor_blocks_bytes function nearly 5 times slower than the sse version despite nearly identical assembly?
  • Why is the xor_blocks_u128 function on par with the sse version, despite not using SIMD instructions. Are the smaller mov's and xor's twice as fast as the SIMD operations? But then there wouldn't really be a point in the 128-bit instructions...

I'm also happy about other recommendations on how to track down where the difference is coming from :)

EDIT: Here is a repository with the code and benchmarks.

Of note is that adding -Ctarget-cpu=native they optimize identically.
Both version compile down to this:

example::xor_blocks_bytes:
        cmp     rsi, rcx
        cmova   rsi, rcx
        test    rsi, rsi
        je      .LBB2_3
        xor     eax, eax
.LBB2_2:
        vmovaps xmm0, xmmword ptr [rdi + rax]
        vxorps  xmm0, xmm0, xmmword ptr [rdx + rax]
        vmovaps xmmword ptr [rdi + rax], xmm0
        add     rax, 16
        dec     rsi
        jne     .LBB2_2
.LBB2_3:
        ret

This is much simpler than the versions without -Ctarget-cpu native. It looks like llvm attempted to auto-vectorize with 256-bit vectors, and then emulate them with 128-bit instructions, whereas the native version doesn't try any of that funkiness. I have no explanation for why the similar looking versions have such a huge time disparity.

2 Likes

Quite the puzzle. You've also, interestingly, found an example where the level-2 and level-3 optimizations do extremely different things: https://rust.godbolt.org/z/fPnz7ze84. (That's probably not a problem; I just thought it interesting.)

One thing I always suggest for looking at vectorization is to look at the LLVM IR, not the ASM. That way you can separate what it's trying to do semantically from the exact choices that the LLVM codegen uses to implement that. For example, in SIMD for a FIR filter: unstable result and partitioning in as_simd - #8 by scottmcm that's the difference between

example::dot.exit:         ; preds = %bb11.i
  %10 = tail call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> %9), !dbg !85
  ret float %10, !dbg !94

and

        xorps   xmm1, xmm1
        addss   xmm1, xmm0
        movshdup        xmm2, xmm0
        addss   xmm2, xmm1
        movaps  xmm1, xmm0
        unpckhpd        xmm1, xmm0
        addss   xmm1, xmm2
        shufps  xmm0, xmm0, 255
        addss   xmm0, xmm1
        ret

(As a bonus, that means that if the ASM it produces for those is too weird you have a repro for an LLVM bug already, since they don't want Rust repros.)

I decided to start simpler, and look at just the single-block operations in isolation: https://rust.godbolt.org/z/dr774zEcf

impl std::ops::BitXorAssign for BlockBytes {
    fn bitxor_assign(&mut self, rhs: Self) {
        self.data.iter_mut().zip(rhs.data).for_each(|(a, b)| *a ^= b);
    }
}

impl std::ops::BitXorAssign for BlockU128  {
    fn bitxor_assign(&mut self, rhs: Self) {
        self.data ^= rhs.data;
    }
}

impl std::ops::BitXorAssign for BlockSse  {
    fn bitxor_assign(&mut self, rhs: Self) {
        self.data = unsafe { _mm_xor_si128(self.data, rhs.data) };
    }
}

And that actually stumbled on something really interesting! The code for BlockBytes turned out terrible.

But the LLVM shows something interesting up top:

%_4.i = alloca %"core::iter::adapters::zip::Zip<core::slice::iter::IterMut<'_, u8>, core::array::iter::IntoIter<u8, 16>>", align 8

It apparently isn't able to remove the copy to stack for the IntoIterator.

Changing it from .zip(rhs.data) to .zip(rhs.data.iter()) simplifies it down to the expected: https://rust.godbolt.org/z/81rrnMPTW

define void @"_ZN68_$LT$example..BlockBytes$u20$as$u20$core..ops..bit..BitXorAssign$GT$13bitxor_assign17hb9732acdc931c5eaE"(ptr noalias noundef align 16 dereferenceable(16) %self, ptr noalias nocapture noundef dereferenceable(16) %rhs) unnamed_addr #0 personality ptr @rust_eh_personality !dbg !6 {
  %0 = load <16 x i8>, ptr %rhs, align 1, !dbg !12, !noalias !25
  %1 = load <16 x i8>, ptr %self, align 16, !dbg !30, !alias.scope !51, !noalias !25
  %2 = xor <16 x i8> %1, %0, !dbg !30
  store <16 x i8> %2, ptr %self, align 16, !dbg !30, !alias.scope !51, !noalias !25
  ret void, !dbg !56
}

Although, interestingly, it's only align 1 for the RHS for some reason, resulting in movups instead of the movaps that the SSE version generates.

So they're all subtly different. The other two are

define void @"_ZN67_$LT$example..BlockU128$u20$as$u20$core..ops..bit..BitXorAssign$GT$13bitxor_assign17hd1f08d9240fdb4beE"(ptr noalias nocapture noundef align 16 dereferenceable(16) %self, ptr noalias nocapture noundef readonly dereferenceable(16) %rhs) unnamed_addr #1 !dbg !57 {
  %_3 = load i128, ptr %rhs, align 16, !dbg !59
  %0 = load i128, ptr %self, align 16, !dbg !60
  %1 = xor i128 %0, %_3, !dbg !60
  store i128 %1, ptr %self, align 16, !dbg !60
  ret void, !dbg !61
}

define void @"_ZN66_$LT$example..BlockSse$u20$as$u20$core..ops..bit..BitXorAssign$GT$13bitxor_assign17h189076cbb3423346E"(ptr noalias nocapture noundef align 16 dereferenceable(16) %self, ptr noalias nocapture noundef readonly dereferenceable(16) %rhs) unnamed_addr #2 !dbg !62 {
  %_4 = load <2 x i64>, ptr %self, align 16, !dbg !64
  %_5 = load <2 x i64>, ptr %rhs, align 16, !dbg !66
  %0 = xor <2 x i64> %_5, %_4, !dbg !67
  store <2 x i64> %0, ptr %self, align 16, !dbg !74
  ret void, !dbg !75
}

That u128 version emits as two u64 operations in the assembly, which is also interesting.

<example::BlockU128 as core::ops::bit::BitXorAssign>::bitxor_assign:
        mov     rax, qword ptr [rsi]
        xor     qword ptr [rdi], rax
        mov     rax, qword ptr [rsi + 8]
        xor     qword ptr [rdi + 8], rax
        ret

I don't know enough about cost models to know for sure, but it seems plausible that -- especially since the loads and stores are all align 16 -- that it should be emitting it as one xmm xor instead of the two qword xors.

Somehow xor_blocks_bytes recovers the align 16 for both loads though, so that's also curious (but good!): https://rust.godbolt.org/z/3xon77T73

2 Likes