Weird SIMD behavior

I have a simple rust function that parses varint encoding.

struct Reader {
    pub pos: usize,
    pub corrupt: bool,
impl Reader {
     fn read_var(&mut self, b: &[u8]) -> u64 {
        let mut i = 0u64;
        let mut j = 0;
        loop {
            if j > 9 {
                self.corrupt = true;
                return 0;
            let v = self.read_u8(b);
            i |= (u64::from(v & 0x7F)) << (j * 7);
            if (v >> 7) == 0 {
                return i;
            } else {
                j += 1;

    fn read_u8(&mut self, b: &[u8]) -> u8 {
        if self.pos < b.len() {
            let v = b[self.pos];
            self.pos += 1;
        } else {
            self.corrupt = true;

I have 2 versions of generated code by different compilers:

The non-SIMD version is relatively easy to understand. It inlines read_u8 and unwinds loop.
I am not familiar with SIMD instructions, but the SIMD version seems to have a similar structure.

One weird thing is, when I run the SIMD version in multiple threads concurrently in a multi-core machine (different Reader objects each thread), processing throughput dropped significantly, but CPU util is higher than single thread version.
The non-SIMD version throughput scales linearly with concurrency level.

Does anyone know how could this happen?

Some related questions:

  • The code does not look like it could benefit from SIMD. Why SIMD is generated?
  • Is it possible to disable SIMD generation for a single function?

Please link cross-posts in your question to avoid duplicated effort:


This is good advice in general, and in any case thanks for linking the post!

What you can’t know: It looks like in this case, (unless I’m mis-reading the time-stamps), the stack overflow post was posted after this one], and possibly[1] in response to our spam filter temporarily hiding the topic until I could approve it, about 40min later :slight_smile:

  1. though if that’s the case, that happened remarkably quickly; but even so, the thread being hidden means that even adding a link between the posts after the fact was technically impossible ↩︎


Trying my best to earn the honorary title of cross-post detective #1 (maybe you as moderator can do something about it?) :smile:

My bad, I saw this post after the one on SO. Indeed the timestamps suggest this post was created earlier.

1 Like


The target feature RFC calls this "unconditional code generation" and you can disable target features by adding a "-" in front of the feature name. For example,

#[target_feature(enable = "-sse4")] 
unsafe fn no_sse4() {

Compilers often apply various optimizations, including SIMD, based on heuristics and cost models. It's possible that the compiler has decided to generate SIMD code for some reason, even though it might not be the most efficient choice for this specific workload.

I'm not sure. Maybe the data you pass in tends to hit performance cliffs for the SIMD implementation? It might also be that SIMD isn't as helpful when there is a dependency between consecutive bytes in the stream.

I've only ever dabbled in SIMD, so I thought I'd ask ChatGPT. This is what they said:

While SIMD can provide significant performance improvements for some workloads, there are cases where it might not be as performant as expected, or even result in decreased performance. Let's explore some of the reasons why this might happen:

  1. Data Dependency: SIMD works best when data can be processed independently and in parallel. If there are data dependencies between iterations or parts of the data that need to be processed serially, SIMD may not provide the expected performance benefits.

  2. Instruction Overhead: While SIMD can accelerate data processing, it can also introduce overhead in terms of instruction decoding, scheduling, and execution. If the SIMD version of the code has higher instruction-level overhead than the non-SIMD version, it could potentially negate the performance gains from parallel processing.

  3. Memory Bottlenecks: SIMD processing can place increased pressure on the memory subsystem (e.g., cache, memory bandwidth) due to the higher volume of data being processed simultaneously. If the memory subsystem becomes a bottleneck, the SIMD version of the code may not scale as well with concurrency, leading to reduced performance.

  4. Workload Imbalance: SIMD works best when data elements can be evenly divided among the available SIMD lanes. If the data distribution or the workload is not well-suited for SIMD, the processor might not be able to fully utilize all SIMD lanes, leading to reduced performance gains.

  5. Thread Contention: In a multi-threaded scenario, SIMD can potentially increase contention for shared resources such as caches, memory bandwidth, and execution units. This contention can lead to increased latency and reduced performance, especially when multiple threads are executing SIMD code concurrently.

  6. Compiler Heuristics: Compilers use heuristics and cost models to decide when to apply SIMD optimizations. These heuristics might not always produce the most efficient SIMD code for a particular use case, leading to suboptimal performance.

In your specific case, it's possible that one or more of these factors are contributing to the observed performance difference between the SIMD and non-SIMD versions of your code. To further investigate, you might want to profile your code with different SIMD configurations, examine cache behavior, and experiment with different data sets to determine the root cause of the performance discrepancy.

Keep in mind that the best optimization strategy depends on the specifics of your workload and target hardware. It's important to thoroughly profile and test your code to determine the most effective approach for your use case.

1 Like

Thanks for the detailed response!
I figured out the reason.
It turned out the SIMD version was built with "RUSTFLAGS=-Cinstrument-coverage", which injects additional instruction (those SIMD instructions) to track code execution.

Not quite sure how these injected instructions work and how do they affect performance, and why performance got much worse with more concurrency.


I'm assuming they work by setting aside a counter for each instrumentation point and using atomic instructions to increment that counter. When you've got multiple threads all running the same code (i.e. Reader::read_var()) it means all those threads will be updating the same counter in memory, and the memory bus will need to communicate all those atomic increments between the different cores on your machine.

All that extra communication isn't free, especially when the same code is running in a hot loop.


This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.