What can cause pessimization like std::hint::black_box for pass-by-value?

I was doing some benchmarks on when it is faster to pass by value vs by reference for Copy types.

Basically when the compiler is able to optimize, it doesn't seem to make a difference even for fairly large types (a function that takes four Mat4 benchmarks the same by-value and by-reference).

However when wrapping the inputs with std::hint::black_box and calling the function, the by-value version of the above function is about 70% slower than the the by-reference version.

Are there situations that could trigger this kind of pessimization like black_box?

It's a little bit more ergonomic to write my functions as pass-by-value, but then I'm worried that in some cases they will generate slower code.

unless I have a strict performance budget requirement, I always prioritize code maintainability over performance.

for Copy types, I would say passing by-value should be preferred, especially for smaller scalar types that can be passed in a register (or a couple of registers), and this is usually better optimized, because the optimizer have fewer levels of indirection to look-through.

these types are usually said to have "value semantics", and there's no benefit to pass them by reference. one exception is in generic context where the algorithm would work the same for Copy and non-Copy types, in which case using references is perfectly valid, and after monomophization, the optimizer should do a decent job for Copy types.

that said, for performance optimizations, decisions should be made based on measurement, not "feels" or "worries".

be very careful about what you are actually measuring. if it's difficault to measure the real code, at least make sure the "benchmarks" resembles what's really happens in the real code.

from my personal experiences, it's very likely the huge performance difference you saw is not caused by the difference between by-value and by-reference, but because of eliminated dead code, which is prevented if you use blackbox.

the single most important optimization a compiler does is inlining, because it enables many more optimization opputunities.

in many contrived "micro"-benchmarks I have seen, the entire body of the hot loop could be eliminated as dead code, because the result was not used, which is a common fallacy people fell into.

I would definitely always pass scalar types by value, and the benchmarks I ran showed no performance difference for scalar Copy types (f64) as expected. With Vec2 pass-by-value was the same or slightly faster with one or two Vec2 arguments. But at Vec3 and larger pass-by-value started getting slower.

This code gets used in the physics and collision for a game, so the performance budget is about 20ms per frame.

I did some things to try and make sure the code wasn't eliminated, like collecting the output of the functions and using it after the loop, and also generating random inputs to make sure it wasn't able to const evaluate or memorize the result.