Yesterday after talking about -C target-cpu vs. CPU feature detection, it occurred to me that a simple way to systematically check whether my actual application might benefit from CPU feature detection would be to run benchmarks under -C target-cpu=native and look for the ones which get faster — those are the ones that could benefit from run-time detection (if one is trying to produce a portable, distributable binary and not merely compile with -C target-cpu=native everywhere). This only finds opportunities present on the one architecture, of course.
It's interesting that one particular benchmark takes 72% less time with -C target-cpu=native. That’s far more speedup than I’d expect for the kind of code it is (much more about accessing memory than number-crunching). I wouldn’t be surprised if there is something else to improve there and the previous codegen was suboptimal for reasons unrelated to the availability of partticular instructions — or perhaps I'll learn something new.
P.S. And of course, sometimes benchmark numbers are complete flukes.