unless I have a strict performance budget requirement, I always prioritize code maintainability over performance.
for Copy types, I would say passing by-value should be preferred, especially for smaller scalar types that can be passed in a register (or a couple of registers), and this is usually better optimized, because the optimizer have fewer levels of indirection to look-through.
these types are usually said to have "value semantics", and there's no benefit to pass them by reference. one exception is in generic context where the algorithm would work the same for Copy and non-Copy types, in which case using references is perfectly valid, and after monomophization, the optimizer should do a decent job for Copy types.
that said, for performance optimizations, decisions should be made based on measurement, not "feels" or "worries".
be very careful about what you are actually measuring. if it's difficault to measure the real code, at least make sure the "benchmarks" resembles what's really happens in the real code.
from my personal experiences, it's very likely the huge performance difference you saw is not caused by the difference between by-value and by-reference, but because of eliminated dead code, which is prevented if you use blackbox.
the single most important optimization a compiler does is inlining, because it enables many more optimization opputunities.
in many contrived "micro"-benchmarks I have seen, the entire body of the hot loop could be eliminated as dead code, because the result was not used, which is a common fallacy people fell into.