struct Operation {
src: *const u8,
dst: *mut u8,
amount: usize,
}
pub fn process(data: &[Operation; 5]) {
for op in data {
unsafe {
core::ptr::copy_nonoverlapping(
op.src,
op.dst,
op.amount,
);
}
}
}
Code in C++:
#include <cstring>
#include <cstdint>
using u8 = uint8_t;
struct Operation {
u8* src;
u8* dst;
size_t len;
};
void process(Operation* data) {
for (size_t i = 0; i < 5; ++i) {
Operation op = data[i];
memcpy(op.dst, op.src, op.len);
}
}
Rust call memcpy this way:
; Move address of function to register r14
mov r14, qword ptr [rip + memcpy@GOTPCREL]
; Indirectly call function using pointer stored in r14
; All later invokations reuse value in r14 register.
call r14
while C++ does it directly:
; Every call to memcpy uses symbol directly in call instruction.
call memcpy@PLT
Why is that? Is there any performance benefits of one way or another?
Clang emits a reference to a PLT stub for memcpy. This is a function which loads the actual address of the target function from the GOT and then jumps to it. The PLT is used to allow lazy resolving of symbols. On modern systems lazy symbol resolution doesn't provide much benefit anymore. It does require the GOT to be kept writable though. There is a security issue mitigation called RELRO which makes the GOT read-only as soon as possible. When RELRO is used, lazy symbol resolution is no longer possible and as such using a PLT doesn't provide any benefits anymore. As such when using RELRO many compilers emit direct GOT loads rather than jumps to a PLT that does the GOT load. Rustc has RELRO enabled by default and as such it too directly accesses the GOT rather than goes through the PLT.
If it's helpful to anyone else curious about this discussion, these acronyms refer to tables in the Position Independent Code (PIC) system in Linux ELF shared libraries.
If you write memcpy calls like that in Rust, you need a really, really, really good reason, and profiling to back it up. You probably won't get a performance gain over copying Rust arrays.
You probably won't get a performance gain over copying Rust arrays.
Fair comment. Aside, I've had some experience with this recently on an aarch64 target, where memcpy to device memory (e.g. UIO backed by FPGA AXI bus, with stricter rules around access) would sometimes cause bus errors. Implementing the "copy as Rust array" via ptr::volatile_read/write< u64> resulted in the memory access alignment rules being obeyed, whereas copy_nonoverlapping ended up using memcpy from libc. Since memcpy is intended for system memory, not device memory, it does not follow the device memory access rules and is not appropriate for such operations.
For example, a recent change to libc modified memcpy to copy in 128-bit wide chunks instead of 64-bit. This change was enough to cause a bus error on my platform.
Copying u64s via volatile ptrs is significantly slower than memcpy though. I don't have my measurements handy but it was thousands of times slower if I remember correctly.
Spittballing here; this could be the lack of aliasing information causing pessimistic generation (eg it doesn't know that the output range doesn't overlap with the input range, so every operation must be serialized); it might be worth experimenting with first copying to a local buffer it can see is not aliasing the output, but part of being volatile is just that it can't do optimizations, so maybe there's just not much it can do?