Stateful simd-enabled hashing routine foiled by memcpy for small input

I'm trying to optimize an incremental hashing routine that is simd enabled. There's a problem and it lies with the facade.

enum HighwayChoices {
    Portable(PortableHash),
    Avx(AvxHash),
}

pub struct HighwayHasher(HighwayChoices);

impl HighwayHasher {
    pub fn append(&mut self, data: &[u8]) {
        match &mut self.0 {
            HighwayChoices::Portable(x) => x.append(data),
            HighwayChoices::Avx(x) => x.append(data),
        }
    }

    // ... snip ...
}

The HighwayHasher facade will choose the best hashing implementation at runtime based on the underlying CPU capabilities.

I recently benchmarked the facade. At payloads under 1KB, the facade's throughput can be half that of the underlying AVX implementation. Of course, some overhead should be expected from the facade, but this is seems excessive. And such a large of a caveat is undesirable for the main entrypoint of the library.

Profiling benchmarks revealed the main culprit to be a memcpy of 192 bytes, which is a copy of the underlying hashing implementation in the construction of the facade. This copy is present even if we assume an AVX capable machine:

impl HighwayHasher {
    pub fn new(key: Key) -> Self {
        let h = unsafe { AvxHash::force_new(key) };
        HighwayHasher(HighwayChoices::Avx(h))
    }
}

The memcpy is not present if HighwayHasher is simply a newtype around AvxHash.

The question: are there reasonable techniques to eliminate the memcpy? Or is there a better way to have stateful simd-enabled routines? If incremental hashing wasn't necessary, I could follow in the footsteps of memchr: creating a pseudo-ifunc where a pointer to the optimized function is kept.

Or do I need to push this decision onto the caller. "Only use HighwayHasher when you need incremental hashing, otherwise use a one-shot function".

Criminal attempt to circumvent memcpy

It is awfully tempting to try and exploit the fact that every hashing implementation has a size of 192 bytes.

In an attempt to see if the problem was specifically related to enums, I tried to force in-place instantiation with:

pub struct HighwayHasher([u8; 192]);
impl HighwayHasher {
    pub fn new(key: Key) -> Self {
        let h = unsafe { AvxHash::force_new(key) };
        let data: [u8; 192] = unsafe { core::mem::transmute(h) };
        HighwayHasher(data)
    }
}

Unfortunately, this doesn't eliminate the memcpy as transmute copies bits from the source to destination.

Profiling details Cracking open valgrind and kcachegrind revealed that `HighwayHasher::new` was responsible for 40% of the profiled benchmark time, which is odd considering that `AvxHash::force_new` only accounts for 4%.

A deeper dive determined __memcpy_avx_unaligned_erms to be the culprit. A memcpy! Let's look at the assembly with cargo-show-asm

cargo asm --lib HighwayHasher::new
.section .text.highway::builder::HighwayHasher::new,"ax",@progbits
    .globl  highway::builder::HighwayHasher::new
    .p2align    4, 0x90
    .type   highway::builder::HighwayHasher::new,@function
highway::builder::HighwayHasher::new:

    .cfi_startproc
    push rbp
    .cfi_def_cfa_offset 16
    .cfi_offset rbp, -16
    mov rbp, rsp
    .cfi_def_cfa_register rbp
    push r15
    push r14
    push rbx
    and rsp, -32
    sub rsp, 448
    .cfi_offset rbx, -40
    .cfi_offset r14, -32
    .cfi_offset r15, -24
    mov rbx, rdi

    movaps xmm0, xmmword ptr [rsi]
    movaps xmm1, xmmword ptr [rsi + 16]
    movaps xmmword ptr [rsp + 16], xmm1
    movaps xmmword ptr [rsp], xmm0
    lea r14, [rsp + 224]
    mov rsi, rsp

    mov rdi, r14
    call qword ptr [rip + highway::x86::avx::AvxHash::force_new@GOTPCREL]

    lea rdi, [rsp + 24]
    mov r15, qword ptr [rip + memcpy@GOTPCREL]
    mov edx, 192
    mov rsi, r14
    call r15
    mov qword ptr [rbx], 2
    mov rdi, rbx
    add rdi, 8
    mov rsi, rsp
    mov edx, 216
    call r15

    mov rax, rbx
    lea rsp, [rbp - 24]
    pop rbx
    pop r14
    pop r15
    pop rbp
    .cfi_def_cfa rsp, 8
    ret

HighwayHasher takes ownership of data / HighwayChoices::Avx(h) on initialization.

I believe a move of ownership always causes rustc to emit a memcpy(), although sometimes the underlying LLVM optimizer may decide after its analysis to remove it. In this case, it obviously hasn't.

According to The Rustonomicon:

Every type must be ready for it to be blindly memcopied to somewhere else in memory.

I'm inclined to think this is an optimization that won't be too straightforward.

I think this is a reasonable solution.

According to The Rust Performance Book:

Rust types that are larger than 128 bytes are copied with memcpy rather than inline code. If memcpy shows up in non-trivial amounts in profiles, DHAT’s “copy profiling” mode will tell you exactly where the hot memcpy calls are and the types involved. Shrinking these types to 128 bytes or less can make the code faster by avoiding memcpy calls and reducing memory traffic.

One way to "shrink" the data would be to place it on the heap and use a Box pointer, but given the extra heap allocation, it may not end up being any faster. Maybe worth thinking about though.

1 Like

Thanks for checking that I wasn't missing something obvious. I'll pursue the one-shot functions as a workaround

I guess the strong alignment requirements inihibit LLVM from doing the optimizations. For me creating a custom tagged union with the tag at the end removes the memcpy.

union HighwayChoicesInner {
    avx: ManuallyDrop<AvxHash>,
    portable: ManuallyDrop<PortableHash>,
    // ...
}

struct HighwayChoices {
    inner: HighwayChoicesInner,
    tag: u8,
}
1 Like

It may be worth to use the cpufeatures crate and use union instead of enum. Your code can look roughly like this:


cpufeatures::new!(avx_cap, "avx");

union Backend {
    portable: PortableHash,
    avx: AvxHash,
}

pub struct HighwayHasher {
    backend: Backend,
    token: avx_cap::InitToken,
}

impl HighwayHasher {
    pub fn new(key: Key) -> Self {
        let (token, val) = avx_cap::init_get();
        let backend = if val {
            Backend {
                avx: AvxHash::new(key),
            }
        } else {
            Backend {
                portable: PortableHash::new(key),
            }
        };
        Self { backend, token }
    }

    pub fn append(&mut self, data: &[u8]) {
        unsafe {
            if self.token.get() {
                self.backend.avx.append(data);
            } else {
                self.backend.portable.append(data);
            }
        }
    }
}

union could be a bit friendlier to compiler optimizations.

Wow, the union does get rid of memcpy on construction! Though now the memcpy has been pushed to hashing finalization (which consumes self).

impl HighwayHasher {
    fn finalize64(self) -> u64 {
        if self.tag == 1 {
            unsafe { ManuallyDrop::into_inner(self.inner.avx).finalize64() }
            // todo ... 
        } else {
            // todo ...
        }
    }
}

Any idea on how to consume self.inner.avx without memcpy?

Fwiw, operating over a mutable reference in append works great.

My dummy AvxHash implementation self.0.into_iter().map(|x| x as u64).sum() generates a memcpy by itself. However self.0.iter().map(|x| *x as u64).sum() does not. I suggest defining an internal AvxHash::finalize64_ref(&mut self) -> u64 method and calling it from the abstraction.

2 Likes

Yup worked perfectly! Thank you for all the help