Simple Rust and C# performance comparison

Hi everyone,
I'm experimenting with rewriting some C# code in Rust and comparing the performance. Some things are much faster in Rust, but this simple function seems quite a bit slower and I'm wondering if anyone has suggestions why. All I'm doing is iterating through an array of f64s and summing up all the values at an index = 19 mod 500. I'm surprised because my rust code is coming in over 3x slower than the C# code.

Here's the original C# function:

public class ArrayIter
    {
        double[] smallArr;

        [GlobalSetup]
        public void Setup()
        {
            smallArr = new double[50000];
            for(int i = 0; i < 50000; i++)
            {
                smallArr[i] = 5.0;
            }
        }

        [Benchmark]
        public double GetSumSmaller()
        {
            var total = 0.0;
            for(int i = 0; i < 100; i++)
            {
                total += smallArr[i * 500 + 19];
            }
            return total;
        }
    }

and here's the two attempts I made to write the rust equivalent. The first is exactly the same, while the second groups the array into groups of 500 and iterates over the groups (note that I'm not black_boxing the index 19 because it's a fixed const in the actual code we run).

use criterion::{criterion_group, criterion_main, BatchSize, Criterion};

fn small_benchmarks(c: &mut Criterion) {
    c.bench_function("fold like c#", |b| {
        b.iter_batched(
            || vec![5.0; 50000],
            |array| {
                let mut total = 0.0;
                for i in 0..100 {
                    total += array[500 * i + 19]
                }
                total
            },
            BatchSize::LargeInput,
        )
    });
    c.bench_function("fold with iter", |b| {
        b.iter_batched(
            || vec![[5.0; 500]; 100],
            |array| array.iter().fold(0.0, |acc, group| acc + group[19]),
            BatchSize::LargeInput,
        )
    });
}

criterion_group!(benches, small_benchmarks);
criterion_main!(benches);

Here are the results of the C# benchmark:

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.17763.1039 (1809/October2018Update/Redstone5)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4121.0), X86 LegacyJIT
  DefaultJob : .NET Framework 4.8 (4.8.4121.0), X86 LegacyJIT


|        Method |     Mean |    Error |   StdDev |
|-------------- |---------:|---------:|---------:|
| GetSumSmaller | 81.93 ns | 1.927 ns | 1.802 ns |

and the Rust function rewrites come in around 3-4 times slower:

fold like c#            time:   [375.82 ns 393.93 ns 424.07 ns]
                        change: [-7.1434% +3.0465% +14.984%] (p = 0.61 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

fold with iter          time:   [304.10 ns 310.30 ns 316.89 ns]
                        change: [-3.7387% +1.8703% +8.8750%] (p = 0.57 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

Any ideas why there's such a big difference and what I could try next?

This was built with rustc 1.40.0 for stable-x86_64-pc-windows-msvc.

Hmm, I have a feeling that your C# code is not doing exactly what you'd expect. I might be totally wrong here but just follow this slightly unscientific argument. There are so many factors that this should be taken with a grain of salt but the deviation from what one should expect is so large that it might give some indications of what's happening:

The bottleneck should be memory access I think. You're only doing 100 very simple additions which is really fast, memory access on a large array however is not nearly as fast.

Even if all the data fits in the L1 cache hit should take around a nanosecond to actually access the data. On 100 accesses you could probably achieve 85 ns per test iteration and still be in the range one would expect, however the array you create is roughly 400 kb, which is way too large for the L1 cache. The access pattern is also such that you're accessing parts of the whole array on each test run.

So no matter you'll have to hit L2, and L3 which should put you the range of 200ns ++ on 100 accesses. This seems to be in line with the Rust results, but I think it's unrealistic to get down to sub 100 ns range the way it's written.

So this could be one of those cases where you're not really measuring the same thing.

I think what's happening is that you are testing closures that cannot be optimized well. For example, if you hoist the vectors to the top of the function, the closures perform much better:

Comparison between OP and hoisted
fn small_benchmarks(c: &mut Criterion) {
    let data = vec![5.0; 50000];
    let grouped_data = vec![[5.0; 500]; 100];

    c.bench_function("fold like c#", |b| {
        b.iter_batched(
            || vec![5.0; 50000],
            |array| {
                let mut total = 0.0;
                for i in 0..100 {
                    total += array[500 * i + 19]
                }
                total
            },
            BatchSize::LargeInput,
        )
    });
    c.bench_function("fold with iter", |b| {
        b.iter_batched(
            || vec![[5.0; 500]; 100],
            |array| array.iter().fold(0.0, |acc, group| acc + group[19]),
            BatchSize::LargeInput,
        )
    });

    c.bench_function("fold like c# - with ref", |b| {
        b.iter(|| {
            let array = &data;
            let mut total = 0.0;
            for i in 0..100 {
                total += array[500 * i + 19]
            }
            total
        })
    });
    c.bench_function("fold with iter - with ref", |b| {
        b.iter(|| grouped_data.iter().fold(0.0, |acc, group| acc + group[19]))
    });
}

Results:

fold like c#            time:   [982.17 ns 1.0127 us 1.0478 us]
                        change: [-7.5098% -1.6094% +4.4651%] (p = 0.61 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

fold with iter          time:   [983.82 ns 1.0049 us 1.0287 us]
                        change: [+5.4286% +9.4820% +13.988%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

fold like c# - with ref time:   [69.562 ns 69.973 ns 70.439 ns]
                        change: [+3.1982% +4.2094% +5.2589%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

fold with iter - with ref
                        time:   [76.272 ns 76.742 ns 77.235 ns]
                        change: [+0.5322% +1.4817% +2.4084%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

And you can get even better perf by using sized types. In this example, I'm using the arrayref crate to treat the slice as an array:

Comparison between hoisted and arrayref
fn fold_like_c_sharp_array(array: &[f32; 50000]) -> f32 {
    let mut total = 0.0;
    for i in 0..100 {
        total += array[500 * i + 19]
    }
    total
}

fn fold_with_iter_array(array: &[[f32; 500]; 100]) -> f32 {
    array.iter().fold(0.0, |acc, group| acc + group[19])
}

fn small_benchmarks(c: &mut Criterion) {
    let data = vec![5.0; 50000];
    let grouped_data = vec![[5.0; 500]; 100];

    c.bench_function("fold like c# - with ref", |b| {
        b.iter(|| {
            let array = &data;
            let mut total = 0.0;
            for i in 0..100 {
                total += array[500 * i + 19]
            }
            total
        })
    });
    c.bench_function("fold with iter - with ref", |b| {
        b.iter(|| grouped_data.iter().fold(0.0, |acc, group| acc + group[19]))
    });

    c.bench_function("fold like c# - with array", |b| {
        b.iter(|| fold_like_c_sharp_array(array_ref!(&data, 0, 50000)))
    });
    c.bench_function("fold with iter - with array", |b| {
        b.iter(|| fold_with_iter_array(array_ref!(&grouped_data, 0, 100)))
    });
}

Results:

fold like c# - with ref time:   [65.065 ns 65.421 ns 65.807 ns]
                        change: [-6.7325% -5.7503% -4.7641%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

fold with iter - with ref
                        time:   [73.440 ns 73.779 ns 74.158 ns]
                        change: [-4.2310% -3.4827% -2.7538%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

fold like c# - with array
                        time:   [59.666 ns 60.101 ns 60.612 ns]
                        change: [-3.5995% -2.6148% -1.6103%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

fold with iter - with array
                        time:   [53.940 ns 54.405 ns 54.947 ns]
                        change: [-7.2678% -5.6410% -4.0074%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Benchmarks run on: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

2 Likes

Ok, so I had to dig a bit deeper into this. The major reason for the initial difference is that the benchmark includes the time it takes to drop the vec![5.0; 50000]. This is not what you want. From the Criterion docs:

If the setup value implements Drop and you don't want to include the drop time in the measurement, use iter_batched_ref , otherwise use iter_batched.

Using iter_batched_ref instead of iter_batched gives the following results on my setup:

Results using `iter_batched_ref`

fold like c#            time:   [153.72 ns 158.33 ns 163.36 ns]
                       change: [-7.0119% -1.8145% +3.8423%] (p = 0.51 > 0.05)
                       No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
 2 (2.00%) high mild
 3 (3.00%) high severe

fold with iter          time:   [119.70 ns 122.70 ns 126.14 ns]
                       change: [-59.629% -54.083% -49.019%] (p = 0.00 < 0.05)
                       Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
 7 (7.00%) high mild
 1 (1.00%) high severe

fold like c# - with ref time:   [70.809 ns 71.177 ns 71.623 ns]
                       change: [-6.6368% -4.4947% -2.3495%] (p = 0.00 < 0.05)
                       Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
 7 (7.00%) high mild
 7 (7.00%) high severe

fold with iter - with ref
                       time:   [63.081 ns 63.420 ns 63.815 ns]
                       change: [-7.6800% -6.4790% -5.0084%] (p = 0.00 < 0.05)
                       Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
 1 (1.00%) high mild
 3 (3.00%) high severe

Now, I also took a look on how the code compiles using a hoisted reference and while I can't say for sure what happens in the criterion benchmark, it seems to do some loop unrolling which gave me results down to 30 ns per «iteration» on average (which when set to 1000 iterations might actually be less due to the optimizations). The average also varied a lot based on the number of iterations in the test run. I can't say for sure, but there seems to be something going on where it relies on the input staying exactly the same for each iteration.

However, for the comparison with C# I believe that @parasyte's benchmarks are probably what you're looking after.

In general, I would be careful concluding on micro-benchmarks like these, especially using two different benchmark frameworks. Even if they were the same there is just so much that you don't control 100% which makes it very difficult to explain any differences.

4 Likes

Thanks for the suggestions @cfsamson and @parasyte. It looks like the C# benchmark equivalent to iter_batched_ref has a problem where it only works accurately on tests that take at least 100ms. I ended up building these into somewhat larger programs and the Rust code was about twice as fast in the final result.

3 Likes

This. For microbenchmarks we've seen things like code being 2x slower depending whether the code ended up on one cache line or two. Not to mention all the weirdness that happens between dependencies around loops, etc. See https://llvm.org/docs/CommandGuide/llvm-mca.html for some of the crazy things that you need to consider for particularly small things -- I think it's noteworthy that there's a special tool for doing it using models of the specific processor, rather than trying to time anything.

You probably know this, but that traversal pattern is pretty bad for the cache -- it has to load a full cache line but only uses one float from it. If this is an important-enough operation to benchmark, consider transposing your storage so that you'd be accessing array[500*19 + i] in the loop instead. (That's probably make SIMD vectorization more likely too, since adjacent items are more easily loaded into SIMD registers.)

3 Likes

Yeah, the layout is designed to be cache friendly for how we read the data (which happens more often).

Edit: I realized afterward that this isn't so clear. In most of the code we read the data in such a way that this layout uses the cache properly. But in one piece of code we have to read it transposed and that's what I was trying to compare here.

1 Like

Wait a minute:

Your C# benchmark is running a "for" loop over an existing, initialized, array, presumably allocated before the "GetSumSmaller" is called and freed afterwards.

Meanwhile, you Rust benchmark includes allocation and initialization of that array. And presumably freeing it afterwards.

You are not comparing like for like at all.

The Rust benchmark didn't include allocation and initialization of the array - the first argument to iter_batched is the setup which isn't timed. As @cfsamson pointed out I should have used iter_batched_ref to make sure that the the test also didn't include the drop.

But it seems like the real problem was that the C# code only does the setup once and was able to recognize that I was operating on the same array in each iteraion of the test and optimize it away. That benchmarking library only supports running the setup on each test iteration if the test takes over 100ms.

Ah yes, silly me.

There is a bunch of lambdas in there. One cannot tell the order of execution from reading the source code.

Why do people think that kind of obfuscation is a good idea?

I expect because we don't have an idea for a better way to run arbitrary setup an arbitrary number of times before each timing. Are you thinking a named function would have been more clear?

Very likely.

At least I would not have been confused by it on a casual reading.

There are those that suggest code should document itself as much as possible.

They fight against programmers who hold that reducing the number of characters they have to type is a primary goal.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.