Understanding target-cpu

Hello to all,

I have a big doubt about the target-cpus. On my server, as native target-cpu is selected "Cannonlake". I made a sweep among all the possible target-cpus, and to my surprise the target-cpu "knl" was the one that gave me the best performance. This doesn't make much sense, so I want to ask you what could have happened?

The program is the n-body algorithm, I do mathematical calculations that are not very rare.

Obviously I always compile with --release and opt-3.

Any help to understand this ? If I change the server I will have to run with all the target-cpu and keep the best one, does it make sense to do so ?

Thank you.

Unless you've benchmarked it very very carefully in a very controlled way (not just running the programs one after another with different settings), it's likely to be a random result.

Performance of modern CPUs is very noisy and uneven. They have very dynamic balancing act of thermal and energy distribution across cores, varying frequency with temporary turbo boosts, and multiple levels of caches and branch predictors sensitive to bit patterns in memory addresses.

Yes, I checked and I have the same results. My program is very regular so allways I have the same times too.

Is it a bad practice to use a target-cpu other than the native one ? Am I missing something? I understand that target-cpu=knl is for Knights Landing, and I am not in this platform (but the instructions are similar)

You would have to look at the resulting assembly to verify. But it's possible the auto vectorizers is weighted to use avx-512 instructions with a knl target, since that was the main goal of that infrastructure. The cannon lake being newer may have a super set of the avx-512 instructions (there's a ton of variations), but if it's missing any instructions it could crash with an illegal instruction, if the compiler chooses to use any missing instructions. The reason the cannonlake may avoid avx-512 is because if the CPU uses enough avx-512 instructions it needs to reduce clock speed to be stable (they do twice the work and produce more heat so a %30 clock reduction to "double" the FLOPS, avx-2 can do this to a lesser extant as well). So for the average program there's a threshold where enough avx-512 could slow the clock but not gain enough to make up for it (the average program doesn't blaze all cores 100%). but when most of cores are doing heavy avx-512 (like your simulation) that can more than make up for the frequency reduction.

2 Likes

Excelent !!! Thank you for clarifying this for me !