I'm trying to optimize an incremental hashing routine that is simd enabled. There's a problem and it lies with the facade.
enum HighwayChoices {
Portable(PortableHash),
Avx(AvxHash),
}
pub struct HighwayHasher(HighwayChoices);
impl HighwayHasher {
pub fn append(&mut self, data: &[u8]) {
match &mut self.0 {
HighwayChoices::Portable(x) => x.append(data),
HighwayChoices::Avx(x) => x.append(data),
}
}
// ... snip ...
}
The HighwayHasher
facade will choose the best hashing implementation at runtime based on the underlying CPU capabilities.
I recently benchmarked the facade. At payloads under 1KB, the facade's throughput can be half that of the underlying AVX implementation. Of course, some overhead should be expected from the facade, but this is seems excessive. And such a large of a caveat is undesirable for the main entrypoint of the library.
Profiling benchmarks revealed the main culprit to be a memcpy
of 192 bytes, which is a copy of the underlying hashing implementation in the construction of the facade. This copy is present even if we assume an AVX capable machine:
impl HighwayHasher {
pub fn new(key: Key) -> Self {
let h = unsafe { AvxHash::force_new(key) };
HighwayHasher(HighwayChoices::Avx(h))
}
}
The memcpy
is not present if HighwayHasher
is simply a newtype around AvxHash
.
The question: are there reasonable techniques to eliminate the memcpy? Or is there a better way to have stateful simd-enabled routines? If incremental hashing wasn't necessary, I could follow in the footsteps of memchr: creating a pseudo-ifunc where a pointer to the optimized function is kept.
Or do I need to push this decision onto the caller. "Only use HighwayHasher
when you need incremental hashing, otherwise use a one-shot function".
Criminal attempt to circumvent memcpy
It is awfully tempting to try and exploit the fact that every hashing implementation has a size of 192 bytes.
In an attempt to see if the problem was specifically related to enums, I tried to force in-place instantiation with:
pub struct HighwayHasher([u8; 192]);
impl HighwayHasher {
pub fn new(key: Key) -> Self {
let h = unsafe { AvxHash::force_new(key) };
let data: [u8; 192] = unsafe { core::mem::transmute(h) };
HighwayHasher(data)
}
}
Unfortunately, this doesn't eliminate the memcpy as transmute
copies bits from the source to destination.
Profiling details
Cracking open valgrind and kcachegrind revealed that `HighwayHasher::new` was responsible for 40% of the profiled benchmark time, which is odd considering that `AvxHash::force_new` only accounts for 4%.A deeper dive determined __memcpy_avx_unaligned_erms
to be the culprit. A memcpy! Let's look at the assembly with cargo-show-asm
cargo asm --lib HighwayHasher::new
.section .text.highway::builder::HighwayHasher::new,"ax",@progbits
.globl highway::builder::HighwayHasher::new
.p2align 4, 0x90
.type highway::builder::HighwayHasher::new,@function
highway::builder::HighwayHasher::new:
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
push r15
push r14
push rbx
and rsp, -32
sub rsp, 448
.cfi_offset rbx, -40
.cfi_offset r14, -32
.cfi_offset r15, -24
mov rbx, rdi
movaps xmm0, xmmword ptr [rsi]
movaps xmm1, xmmword ptr [rsi + 16]
movaps xmmword ptr [rsp + 16], xmm1
movaps xmmword ptr [rsp], xmm0
lea r14, [rsp + 224]
mov rsi, rsp
mov rdi, r14
call qword ptr [rip + highway::x86::avx::AvxHash::force_new@GOTPCREL]
lea rdi, [rsp + 24]
mov r15, qword ptr [rip + memcpy@GOTPCREL]
mov edx, 192
mov rsi, r14
call r15
mov qword ptr [rbx], 2
mov rdi, rbx
add rdi, 8
mov rsi, rsp
mov edx, 216
call r15
mov rax, rbx
lea rsp, [rbp - 24]
pop rbx
pop r14
pop r15
pop rbp
.cfi_def_cfa rsp, 8
ret