When I call such a function, rust generates this code (compiled with -Ctarget-cpu=native):
vzeroupper
call *(%rax)
This single vzeroupper is currently taking up 13% of the execution time according to perf. Is there any way to mark the function pointer as AVX-enabled, so that the vzeroupper can be omitted?
I'm interested in the solution to this problem but are you sure it's that instruction? perf has some skew and sometimes highlights a nearby instruction, and the indirect jump has to do a load.
You're right. If I benchmark inline assembly with and without vzeroupper, the overhead is closer to 4% of execution time.
That's still 4% I'd like to optimize away, but the impact is not as much as I initially thought.
I've tried putting the call inside inline assembly in my actual code as well, but that unfortunately causes llvm to start doing unnecessary copies to the stack which makes the entire code much slower.