Ptr::copy_nonoverlapping slower then manual per byte copy

I did some benchmarks recently, and discover that
ptr::copy_nonoverlapping considerably slower then manual dumb copy_bytes:

#[inline]
unsafe fn copy_bytes(src: *const u8, dst: *mut u8, count: usize){
    for i in 0..count{
        *dst.add(i) = *src.add(i);
    }
}

On anything <32 bytes and +/- on par with ptr::copy_nonoverlapping for bigger types.

N.B. count parameter being known at compile-time in both copy_ functions have significant performance impact on small types (2-3 times).

I looked at ptr::copy_nonoverlapping implementation and it looks like it just call underlying intrinsic in release version.

So.... why so? Does intrinsic have some internal checks, or this is price for intrinsic call?

Rust: 1.60 msvc
OS: Windows 10 (latest)
CPU: i4771

2 Likes

This is likely highly platform-dependent, but on Godbolt (which I reckon is likely x64 Linux), the intrinsic gets compiled to a call to libc memcpy(), whereas the byte-by-byte naïve function is being inlined and auto-vectorized (Link). The control flow instruction might be the source of the overhead.

I don't know why the memcpy() call isn't being inlined, though, even with aggressive optimizations and full LTO. Perhaps libc is not LTO-enabled (it's pure machine code, i.e. it doesn't contain LLVM bitcode versions for its functions) and thus the compiler can't perform any inlining on it after the fact. Or maybe it's intentional to avoid code bloat, which would result in increased icache pressure (since memcpy is one of the most used functions).

1 Like

I've been frustrated by a problem with similar symptoms with memcmp. The call isn't inlined and ends up going out to libc. It's enough to be measurable in benchmarks that approximate real work loads, so I wound up writing my own memcmp: memchr/util.rs at a13bb071ecee33f351655310217e327d52b9680e · BurntSushi/memchr · GitHub

I never did get to the bottom of it though.

2 Likes

For a precisely known length, LLVM will just replace it with an integer load/store of appropriate length. That's obviously faster than anything that needs to run some logic to figure out what strategy to use to do the copy, like a call to memcpy needs to.

Demo: https://rust.godbolt.org/z/746s5rseE

3 Likes

Thanks everyone! That was informative.