I'm encountering a very strange issue that leads to an order of magnitude drop in performance of a rust program and I'm wondering if anyone here can help explain it.
The program in question is CamillaDSP ( GitHub - HEnquist/camilladsp: A flexible cross-platform IIR and FIR engine for crossovers, room correction etc.). Discussion about the performance issue is here: CamillaDSP - Cross-platform IIR and FIR engine for crossovers, room correction etc. | Page 212 | diyAudio.
CamillaDSP is an audio DSP that applies a chain of filters to an audio stream. Specific filter configurations cause the performance of the main processing thread to drop by a factor of about ten for no apparent reason.
Features of this issue that make it very puzzling are:
- It occurs only with Intel CPUs. AMD seems to be immune.
- It occurs across multiple Intel CPUs with different architectures and generations I've seen it on an N3700 (Atom), i3-3217u, i5-5250u and i5-9600k
- It occurs only under Linux. CamillaDSP is cross-platform and when run on an Intel Windows box the problem doesn't occur.
- "perf stat" shows that when the performance issue occurs the number of instructions executed by the CPU remains the same (within 1%) but the instructions per cycle drops dramatically (one test on an Atom N3700 shows a drop in IPC from 0.70 to 0.16 - that's an average for the whole process so impact on the affected thread is probably greater).
- Setting the rustc option "target-cpu=x86-64-v2" avoids the issue. I can understand this giving a slight performance increase but not a factor of ten.
- ARM architectures are also not affected (tested on rpi3b+ and MacOS ARM).
The fact this occurs only on Intel CPUs, only on Linux, and is related to instructions per cycle executed by the CPU makes me think this must be some sort of adverse interaction with the Intel architecture - rather than just badly performing code - but I'm stumped as to what it might be. Can anyone help explain what might be causing this?
Note I'm not the author of CamillaDSP but I came across this issue and am curious.
If it is so specific to CPU model and OS, I suspect this could be caused by spectre/meltdown mitigations or other microcode updates that your Linux has, but not Windows.
Perhaps the SSE2 code uses a lot of branches, which SSE4 code can avoid?
It would be useful to profile which functions exactly take such a hit, and compare their code when compiled for different SSE levels.
Thank you for your suggestions. I too wondered if it's associated with a spectre/meltdown mitigation. However I don't think it's caused by different microcode updates on Windows vs Linux. The problem doesn't occur when running natively on Windows but does occur when running the Linux version inside WSL2 on Windows. This is essentially a VM running a Linux kernal, so presumably uses the same microcode as the host Windows.
I've used perf to get a better idea of where the slowdown is happening. Here are the results compiled for target-cpu=x86-64 showing the performance slowdown:
(It looks like the forum will only let me include a single image per post so I'll have to continue these results across multiple posts).
Here are the results with an identical configuration but compiled for target-cpu=x86-64-v2 showing much better performance:
So it seems the function affected is Biquad::process_waveform. What makes this very confusing is Biquad::process_waveform spends almost all its time in a tight inner loop that loops across a chunk of audio data (2k samples represented as an array of floating point numbers) and performs some adds and multiplies on each sample. I've compared the code emitted for that inner loop with and without "target-cpu=x86-64-v2" and the instructions do change but nothing that should result in such a major performance change. The loop contains a linear sequence of instructions with no branches or complex memory accesses
To make this more confusing it's possible to tweak the program's configuration slightly so the work done by Biquad::process_waveform is still the same but it performs well even when compiled for "target-cpu=x86-64". Here's an example of that:
In practice compiling with "target-cpu=x86-64-v2" seems to avoid the problem, so getting to the bottom of the problem isn't critical. However the behaviour is very odd, so before just adopting that solution I wanted to check if anyone might be able to help explain the root problem.
Finally here are the results of "perf stat" for the "good" and "bad" performance cases. This seems to show the issue is caused by a drop in IPC, but there's nothing that might account for that - as I said the bulk of the time is spent in a very tight linear loop with no branches.
What's the modification?
BTW, is the function multi-threaded? False sharing could be a major invisible slowdown, and depend on memory access patterns and timing of the access.
The modification is to add a Gain filter to the start of the filter chain. That accounts for why you see the function Gain::process_waveform in the third screenshot but not the others. Gain::process_waveform is also very simple - it iterates through the same chuck of audio data that Biquad::process_waveform uses and multiplies each sample by a gain value. There's a bit more complexity to give a smooth volume transition when a parameter changes, but none of that logic is triggered by these tests.
Both Biquad::process_waveform and Gain::process_waveform are single threaded and run in the same thread.
What does perf show for annotate for two process_waveform?
Here is the result when compiled for target-cpu=x86-64 with slow performance
Here is the same binary but with the Gain filter added, so giving fast performance
This is compiled with target-cpu=x86-64-v2 (fast performance)
Those all showed Biquad::process_waveform.
This is Gain::process_waveform, compiled for target-cpu=x86-64
Am I reading this right? The slow and fast performance loops have the same instructions!?
This is cursed.
IIUC the slow result is measured on an Intel CPUs, while the fast one measured on an AMD one.
It's possible to get something like this if branch targets are not aligned. I'd be curious to know what address this code is loaded at.
No, these two images are the same CPU, same hot instructions, but some extra no-op work was performed before calling this function.
There's almost certainly some kind of microarchitectural stall happening. Intel VTune is the proprietary app that can show you all the weird non-generic per-CPU performance counters that let you narrow it down, I think
perf can do it too if you tell them what their IDs are?
The difference in performance between Intel and AMD may be caused by different latency of
subpd instructions. Intel CPUs usually have latency of 4 cycles for these instructions, while AMD Zen has latency of 3 cycles. But both of them have throughput of 0.5 cycles (there is data dependency between instructions, so the code can not achieve it) and instructions per cycle should not fall below 0.25. The difference in latency by itself should not cause an order of magnitude difference and it does not explain why the issue does not get triggered on Windows. Maybe Windows and Linux use different versions of microcode?
I would try reporting cache misses, to see if major difference.
Google links shows me perf events (