Hey all,
I'm writing a little abstraction over 128-bit blocks which should provide the best possible performance for tasks such as XORing two slices of these blocks in place and provide some convenience methods such as converting a &Block
to a &GenericArray
to use it with the aes crate.
I wondered what would be the best option of storing the bits and whether it makes sense to conditionally use SIMD instructions when available.
I came up with the following options which I tested:
#[repr(align(16))]
pub struct BlockBytes {
data: [u8; 16]
}
#[repr(align(16))]
pub struct BlockU128 {
data: u128
}
#[repr(align(16))]
pub struct BlockSse {
data: __m128i
}
and corresponding xor_blocks
functions which take a: &mut [Block]
and b: &[Block]
and XOR them in-place.
xor_blocks functions
pub fn xor_blocks_bytes(a: &mut [BlockBytes], b: &[BlockBytes]) {
a.iter_mut().zip(b).for_each(|(a, b)| {
a.data.iter_mut().zip(b.data).for_each(|(a, b)| *a ^= b);
})
}
pub fn xor_blocks_u128(a: &mut [BlockU128], b: &[BlockU128]) {
a.iter_mut().zip(b).for_each(|(a, b)| {
a.data ^= b.data;
})
}
pub fn xor_blocks_sse(a: &mut [BlockSse], b: &[BlockSse]) {
a.iter_mut().zip(b).for_each(|(a, b)| {
a.data = unsafe { _mm_xor_si128(a.data, b.data) };
})
}
The functions are benchmarked with criterion.
Benchmark code
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::{Rng, thread_rng};
use rand::distributions::{Standard};
use mpc_block::{BlockBytes, BlockSse, BlockU128, xor_blocks_bytes, xor_blocks_sse, xor_blocks_u128};
pub fn blocks_xor(c: &mut Criterion) {
let mut a: Vec<_> = thread_rng().sample_iter(Standard).map(BlockU128::new).take(1_000_000).collect();
let b: Vec<_> = thread_rng().sample_iter(Standard).map(BlockU128::new).take(1_000_000).collect();
c.bench_function("BlockU128 xor", |bencher| bencher.iter(||
{ xor_blocks_u128(black_box(&mut a), black_box(&b)) }
));
let mut a: Vec<_> = thread_rng().sample_iter(Standard).map(BlockBytes::new).take(1_000_000).collect();
let b: Vec<_> = thread_rng().sample_iter(Standard).map(BlockBytes::new).take(1_000_000).collect();
c.bench_function("BlockBytes xor", |bencher| bencher.iter(||
{ xor_blocks_bytes(black_box(&mut a), black_box(&b)) }
));
let mut a: Vec<_> = thread_rng().sample_iter(Standard).map(BlockSse::new).take(1_000_000).collect();
let b: Vec<_> = thread_rng().sample_iter(Standard).map(BlockSse::new).take(1_000_000).collect();
c.bench_function("BlockSse xor", |bencher| bencher.iter(||
{ xor_blocks_sse(black_box(&mut a), black_box(&b)) }
));
}
criterion_group!(benches, blocks_xor);
criterion_main!(benches);
With these results:
BlockU128 xor time: [2.6061 ms 2.6313 ms 2.6595 ms]
BlockBytes xor time: [12.973 ms 13.095 ms 13.250 ms]
BlockSse xor time: [2.4989 ms 2.5125 ms 2.5265 ms]
The u128
and sse
implementations are usually very close together. The bytes
implementation is consistently 5 times slower than the other two.
Now to the confusing part:
The bytes
and sse
implementations are both compiled to use SIMD instructions and have nearly identical asm code. The u128
operates on on pairs of qword values to XOR one u128
value.
Godbolt link with asm
Diff of the bytes and sse asm
-example::xor_blocks_bytes:
+example::xor_blocks_sse:
cmp rsi, rcx
cmova rsi, rcx
test rsi, rsi
- je .LBB1_7
+ je .LBB0_7
cmp rsi, 1
- jne .LBB1_3
+ jne .LBB0_3
xor eax, eax
- jmp .LBB1_5
-.LBB1_3:
+ jmp .LBB0_5
+.LBB0_3:
mov r8, rsi
and r8, -2
mov ecx, 16
xor eax, eax
-.LBB1_4:
- movaps xmm0, xmmword ptr [rdi + rcx - 16]
- xorps xmm0, xmmword ptr [rdx + rcx - 16]
- movaps xmm1, xmmword ptr [rdi + rcx]
+.LBB0_4:
+ movaps xmm0, xmmword ptr [rdx + rcx - 16]
+ xorps xmm0, xmmword ptr [rdi + rcx - 16]
movaps xmmword ptr [rdi + rcx - 16], xmm0
- xorps xmm1, xmmword ptr [rdx + rcx]
+ movaps xmm0, xmmword ptr [rdx + rcx]
+ xorps xmm0, xmmword ptr [rdi + rcx]
add rax, 2
- movaps xmmword ptr [rdi + rcx], xmm1
+ movaps xmmword ptr [rdi + rcx], xmm0
add rcx, 32
cmp r8, rax
- jne .LBB1_4
-.LBB1_5:
+ jne .LBB0_4
+.LBB0_5:
test sil, 1
- je .LBB1_7
+ je .LBB0_7
shl rax, 4
- movaps xmm0, xmmword ptr [rdi + rax]
- xorps xmm0, xmmword ptr [rdx + rax]
+ movaps xmm0, xmmword ptr [rdx + rax]
+ xorps xmm0, xmmword ptr [rdi + rax]
movaps xmmword ptr [rdi + rax], xmm0
-.LBB1_7:
+.LBB0_7:
ret
I'm not particularly well versed in reading assembly, but to me, it seems like the only difference (aside of the label names) is that, in the unrolled part of the loop, the bytes implementation first reads a block from a
and then from b
and uses two registers xmm0
and xmm1
instead of just one. I would be really surprised if this minor difference resulted in a 5x slowdown.
I also executed cargo asm
locally to check that I get the same asm as shown by godbolt and also executed RUSTFLAGS="--emit asm -C llvm-args=-x86-asm-syntax=intel" cargo bench --no-run
and checked the asm of the actual benches binary (I think) to ensure that no funky optimizations are messing up my benchmark.
So, I guess this post boils down to these questions:
- Why is the
xor_blocks_bytes
function nearly 5 times slower than the sse version despite nearly identical assembly? - Why is the
xor_blocks_u128
function on par with the sse version, despite not using SIMD instructions. Are the smallermov
's andxor
's twice as fast as the SIMD operations? But then there wouldn't really be a point in the 128-bit instructions...
I'm also happy about other recommendations on how to track down where the difference is coming from :)
EDIT: Here is a repository with the code and benchmarks.