I see that in godbolt c++ generates 2 instructions while in rust generates 5 instructions which includes mov instructions and thereby slowing down the code. Can anyone help me on understand why this is happening and if there is any tricks I can use to make it generate an optimal code like c++.
In practice, I suspect you'll find that you don't need to care about this - Rust's inlining will fix it for you most of the time.
However, it might be worthwhile raising a bug report, since the compiler should be using a better ABI for Rust-internal stuff, and thus generating the same code.
This is intentional as the C calling convention used for passing vectors depends on the enabled target features, so it would be easy to get an ABI incompatibility. For SSE vectors on most x86_64 targets it doesn't really matter as SSE2 is required by x86_64 and as such most x86_64 targets have it unconditionally enabled, but for example for AVX vectors it is required to be able to call a function for which AVX is enabled from one for which it isn't enabled and vice versa.