Why is this Rust code slower than C#?

mcs is the Mono C# compiler and Mono was never really competitive with the official Microsoft implementation of .Net which was called .Net Framework. .Net Core (or .Net 5 which is the new name) is probably what you should be testing instead. This blog has some performance comparisons between .Net Framework, Mono and .Net Core and shows just how big the performance gap is. Note: That post is about 4 years old now and there have been major improvements in .Net Core performance since then while Mono has basically stagnated.

9 Likes

So, I find this pretty interesting, but I do think in large part the rust slowness is due largely to some choice in the design of the rust algo, and in some differences in rust's std as compared to c#'s BCL

So, first, my initial results:

Time for C# implementation: 15.359069
Time for RefCell implementation: 32.4133575
Time for index implementation: 19.9009394

That's running cargo run --release, for what it's worth.

I want to point to the fact that rust's HashMap uses SipHash1-3, which I think may be costing performance here, as compared to C#'s unspecified, but probably not DOS-resistant hasher. It's probably worth noting that C#'s dictionary uses the GetHashCode method, so for types like these which don't override the default implementation, it's unspecified, but likely fast and not DOS resistant.

Otherwise, I think there are some wins in using the types more idiomatically, as suggested in this thread already. By switching from Weak to Rc in the RefCell implementation, I get:

Time for RefCell implementation: 22.1388974

Which is quite a savings, at nearly 33%
We're still slower than C#, though. Why? I suspect the allocations may be an issue, as a few people have mentioned. I think we can see that in @2e71828's hash-consuming implementation. Using that on may machine (to have the apples-to-apples comparison), I get:

Time for consuming hash implementation: 16.8432205

Which gets us close to parity here. But I noticed one other thing while testing that. Your implementation doesn't scope each test separately, so the old results aren't dropped. This imposed a significant performance penalty on my machine, because of the size of data we're working with - each of these tests consuming multiple GB. So to get that result while consuming the hashtable, I got an even better result for the index-based option. (Before scoping the tests so we drop the previous results, I got a result of 39.6 seconds on the consuming version, I think largely because of allocator pressure)

Time for index implementation: 14.7979173

And I think this tells a bit more about the problem with reduced-code benchmarks a bit. It's really difficult to write something small and be confident in what exactly you're testing. Is this just an allocator stress test? Or a HashMap stress test? Would changing the hashing algorithm change the performance dramatically? Or could we do better by reorganizing the shape of the data before we do this intensive step?

4 Likes

Allocations and moving values, i.e. you might be benchmarking the memory bandwidth.

I guess in C# everything is reference counted so a fair-er comparison might be using Vec<Rc<Person>> or even Vec<Box<Person>> instead of Vec<Person>. Moving Person from the vector would be more expensive than moving Rc<Person> (or Box<Person>) especially if Person struct is bigger than it is now.

Ah ha, whilst you were busy writing that I was busy installing the dotnet-sdk-3.1 from MS itself to my Debian installation under the WSL for Windows 10.

You are right. It's much better. I pasted the C# we are given here into the

$ dotnet new console --output person
The template "Console Application" was created successfully.
...
$ cp person.cs person/Program.cs
$ dotnet build person --configuration Release
Microsoft (R) Build Engine version 16.7.0+7fb82e5b2 for .NET
Copyright (C) Microsoft Corporation. All rights reserved.
...
Build succeeded.
    0 Warning(s)
    0 Error(s)
$ ./person/bin/Release/netcoreapp3.1/person
Time for C# implementation: 0.3089026
$ cargo run --release
    Finished release [optimized] target(s) in 0.02s
     Running `target/release/junk`
Time for RefCell implementation: 0.2701993
Time for index implementation: 0.2872397
$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/junk`
Time for RefCell implementation: 1.4717617
Time for index implementation: 1.0966824

Conclusion:

Rust is a bit faster than C# in release builds: ~0.28 seconds vs 0.3 seconds.

Rust only slows down by a factor of 3 or 4 in the debug build.

Still none of this correlates with our OP's timings apparently.

Exactly - the C# List<T> will be dealing with pointers to the class, and the class' data will be a pointer to the string, and the int inline. Box would be a closer approximation, from a performance perspective, since C# uses a tracing GC (and not reference counting).

The other issue that someone pointed to upthread is that the GC is likely able to make a sizable allocation, and then allocate several objects (perhaps hundreds or thousands) in that space, while in Rust, each Box or Rc is going to call into the allocator. It's likely this is responsible for the improved performance when swapping in a different allocator, because that allocator can request larger blocks from the system allocator and have it's own code for managing that block, similar to the likely scenario with the GC.

I think it's a very interesting optimization problem, for sure, and helps reveal the trade-offs we make in using one language or another. Here, in Rust, we can control and have to worry about allocation count, and size, and memory layout, and object lifetime. In C#, we have little or no control over allocation count or size, and less control over memory layout in an instance like this. The obvious implementation in Rust may end up slower as a result. Or more memory hungry. But we have explicit control over that, and we can rewrite to change it. In C#, there's often an obvious way to do things, and only a few less-obvious ways that might trade time for space or vice-versa. It's much less up to us to make those decisions.

1 Like

Especially for short-running programs where the GC never actually bothers to free them. I wouldn't be surprised if leaking the Vec would be faster in Rust for this reason...

It makes sense. I totally forgot about that. Benchmarking GC languages is hard, because it may not benchmark a single deallocation, while Rust does. One would have to turn off the garbage collection in C# and leak all memory in Rust to get a pure performance comparison of the algorithm.

Could someone check how big the overhead of RefCell is? Replacing the call to borrow_mut() with as_ptr() and converting the returned *mut T to &mut T via unsafe might yield a noticeable difference in performance. I already have a safe alternative in mind.

Somewhat OT question about C# -- when you assign Person instance from one List to another, does C# duplicate/clone the instance, or does it share a single instance in both containers (so modifying the value in one will modify it in another)? Or something in between, like copy-on-write? (Assuming both lists are still active.)

In C# every object (defined with the class keyword) is a reference type. You can think of every variable containing an Rc pointing to the class instance somewhere on the heap. When you assign instances from one list to the other it's just a pointer copy.

There's no copy-on-write or cloning, just shared mutable state.

1 Like

To add to that: whereas structs in C# are Copy. (And are always zero-initialized, with no way to prevent default construction nor change what it does, due to needing new StructType[5] to always work.)

I've never seen copy-on-write in C#, since (other than the indexing operator) there's no way to distinguish read-context from write-context.

2 Likes

I've found an implementation which generates the same groupings, but takes advantage of the input's runtime layout and data access patterns to be approximately 2.5x faster than @2e71828's best implementation.

Time for RefCell implementation: 0.325071017
Time for index implementation: 0.338226771
Time for consuming hashmap implementation: 0.144807899
Time for consuming btree implementation: 0.168977785

Time for Michael's implementation: 0.067688073

And the implementation:

fn group_by_occupation_michael(people: &Vec<Person>) -> Vec<OccupationGroup<'_>> {
    let mut groups = Vec::with_capacity(people.len());

    let (first, rest) = people.split_first().unwrap();

    let other_groups = rest.iter().map(|p| OccupationGroup {
        occupation: p.occupation.as_str(),
        people: vec![p],
    });
    groups.extend(other_groups);

    for group in &mut groups {
        if group.occupation == first.occupation {
            group.people.push(first);
        }
    }

    groups
}

(playground)

Can you figure out which flaw I've exploited? Bonus points for explaining why it shows that the results for all benchmarks done in this thread are invalid and not applicable to the real world.

2 Likes

Hmm... things get interesting when I scale up from 400_000 persons to 4_000_000:

$ cargo run --release
   Compiling junk v0.1.0 (/mnt/c/Users/michael/conveqs/junk)
    Finished release [optimized] target(s) in 1.56s
     Running `target/release/junk`
Time for RefCell implementation: 3.4880406
Time for index implementation: 3.5521364
$ dotnet run --project  person --configuration Release
Time for C# implementation: 2.9456524
$

C# has crept ahead.

Sadly at the full 40_000_000 persons the C# hangs up my machine for so long I have to kill it and the Rust dies with a segfault.

As far as I can tell this program is creating a new occupation for every person and assigning it to them. Forty million occupations and only one person pursuing each one. Seems kind of redundant.

2 Likes

Yes, good point, I was just typing something along these lines. :slight_smile: I'm wondering how much of a difference any of the provided optimisations make for non-unique occupations. We might be optimising for a case that'd never happen in the real world.

Haha yeah, of course the real-world application would have more people pursuing the same occupation.
I figured assigning random occupations to each person out of a pool would wildly vary times between runs so I preferred having one occupation per person (as well as one extra person for the first occupation, to check the algorithm's validity).
Do you think it's that unapplicable? I believe that similar occupations would reduce times, but this is a repeatable worst-case scenario.

Yeah, it makes all the benchmarking pointless. When each person has their own occupation you are pretty much just comparing the difference between C#'s Dictionary and Rust's BTreeMap when doing a failed lookup and insert, which is totally different in terms of branches taken and memory effects to if you had (for example) 40 million people spread over 1000 occupations.

If I knew there were only 100 occupations then I'd probably drop the Dictionary approach and just use a Vec<OccupationGroup> because modern processors eat linear scans for breakfast.

To that end I though having a reasonable number of occupations, say 1000, would be more realistic and it might also allow the programs to run on my machine at the full 40 million people. So I made a little change. This:

        people.push(Person{occupation: i.to_string(), id: i});

to this:

        people.push(Person{occupation: (i%1_000).to_string(), id: i});

With the results like so:

$ dotnet run --project  person --configuration Release
Time for C# implementation: 4.6866592
$ cargo run --release
   Compiling junk v0.1.0 (/mnt/c/Users/michael/conveqs/junk)
    Finished release [optimized] target(s) in 1.09s
     Running `target/release/junk`
Time for RefCell implementation: 2.321482
Time for index implementation: 2.2244976

Rust wins by a factor of more than two. Case closed :slight_smile:

No really, I'm not prepared to believe Rust could ever be 10 times slower than C#.

1 Like

Again - this isn't the actual C# code I'm trying to write - in it, it's more likely I'll get a 1,000,000 different people (that have more data than just an ID), with every occupation having 1-100 people.
So there'll still be thousands of occupations.

Yeah, I get where you're coming from. My point is more general in that when performing benchmarks your results are only as good as the data you put in and the way the benchmark is written.

If you want to see how long it'll take to group every object in your dataset to see which implementation is faster then test it against your real dataset (or a well distributed sample).

In the same way I can convince people that Chocolate causes weight loss, biases in the way you derive your results may lead you to confidently stride down the wrong path and waste time/money.

2 Likes

I think this got a bit buried between comments. To elaborate on the safe alternative, I had the following crate in mind:

The LCell type in particular looks promising. I never had a case for using the crate, but I always found it quite interesting.

I checked both RefCell::as_ptr and just remove the RefCell altogether and using Rc::as_ptr.

RefCell::as_ptr had no noticeable effect at all in my test run, and using Rc::as_ptr reduced the time to 29 seconds instead of 31.

Pretty negligible - especially when this sacrifices some safety.

1 Like