Why is this Rust code slower than C#?

I'm a beginner, and to learn how to use Rust, I decided to try and convert some code in my project from C# to Rust, and see the performance benefits.
I have the following sample C# code:

class Person
{
    public string Occupation;
    public int Id;
}

class OccupationGroup
{
    public string Occupation;
    public List<Person> People;
}

class Program
{
    static List<OccupationGroup> GroupByOccupation(List<Person> people)
    {
        var occupations = new List<OccupationGroup>();
        var occupationMap = new Dictionary<string, OccupationGroup>();

        foreach (Person person in people)
        {
            if (!occupationMap.TryGetValue(person.Occupation, out var occupation))
            {
                occupation = new OccupationGroup
                {
                    People = new List<Person>(),
                    Occupation = person.Occupation
                };

                occupations.Add(occupation);
                occupationMap.Add(person.Occupation, occupation);
            }

            occupation.People.Add(person);
        }

        return occupations;
    }

    static void Main(string[] args)
    {
        var people = new List<Person>();

        people.Add(new Person() { Occupation = "1", Id = -9 });

        for (int i = 0; i < 40_000_000; i++)
        {
            people.Add(new Person() { Occupation = i.ToString(), Id = i });
        }

        var stopwatch = Stopwatch.StartNew();
        var result = GroupByOccupation(people);

        Console.WriteLine("Time for C# implementation: {0}", stopwatch.Elapsed.TotalSeconds);
    }
}

When trying to convert to Rust, I ran into the problem of there being more than one owner of OccupationGroup - the occupations list and the values of occupationMap.

I tried two different methods to solve it - one is using Rc<RefCell<Occupation>> and Weak<RefCell<Occupation>>, and the other was to make the keys of occupationMap indices of the occupations list (since once an element is added, it isn't altered and will stay in its place).

Here are both solutions:

use std::time::Instant;
use std::rc::Weak;
use std::rc::Rc;
use std::cell::RefCell;
use std::collections::HashMap;

#[derive(Debug, Eq, PartialEq)]
struct Person {
    occupation: String,
    id: isize
}

#[derive(Debug, Eq, PartialEq)]
struct OccupationGroup<'a> {
    occupation: &'a str,
    people: Vec<&'a Person>
}

fn group_by_occupation_using_refcell(people: &Vec<Person>) -> Vec<OccupationGroup> {
    let mut occupations: Vec<Rc<RefCell<OccupationGroup>>> = vec![];
    let mut occupation_map: HashMap<&str, Weak<RefCell<OccupationGroup>>> = HashMap::new();

    for person in people {
        let occupation_group = occupation_map.entry(person.occupation.as_str()).or_insert_with(|| {
            let new_occupation_refcell = Rc::new(RefCell::new(OccupationGroup{
                people: vec![],
                occupation: &person.occupation
            }));
            let weak_ref = Rc::downgrade(&new_occupation_refcell);
            occupations.push(new_occupation_refcell);

            weak_ref
        });

        occupation_group.upgrade().unwrap().borrow_mut().people.push(person);
    }

    occupations.into_iter().map(|x| Rc::try_unwrap(x).unwrap().into_inner()).collect()
}

fn group_by_occupation_using_indices(people: &Vec<Person>) -> Vec<OccupationGroup> {
    let mut occupations: Vec<OccupationGroup> = vec![];

    // map to index instead of ref
    let mut occupation_map: HashMap<&str, usize> = HashMap::new();

    for person in people {
        let occupation_index = *occupation_map.entry(person.occupation.as_str()).or_insert_with(|| {
            let new_occupation = OccupationGroup{
                people: vec![],
                occupation: &person.occupation
            };
            occupations.push(new_occupation);

            occupations.len() - 1
        });

        occupations[occupation_index].people.push(person);
    }

    occupations
}

pub fn main() {
    let mut people = vec![
        Person{occupation: "1".to_string(), id: -9}
    ];

    for i in 0..40_000_000 {
        people.push(Person{occupation: i.to_string(), id: i});
    }

    let now = Instant::now();
    let _result = group_by_occupation_using_refcell(&people);
    println!("Time for RefCell implementation: {}", now.elapsed().as_secs_f64());


    let now = Instant::now();
    let _result = group_by_occupation_using_indices(&people);
    println!("Time for index implementation: {}", now.elapsed().as_secs_f64());
}

The outputs were as follows:

Time for C# implementation: 35.5179199
Time for RefCell implementation: 53.7812735
Time for index implementation: 62.7572571

I can't understand how come the C# code ran twice as fast.
I assume I'm doing something wrong, but I can't figure out what, so I'd love some help.

Thanks!

1 Like

When someone asks "why Rust is slow?", the first answer almost always is "did you run it with --release"? So... did you run it with --release? The difference might be an order of magnitude large in some cases.

4 Likes

Running it on the playground with 400,000 inputs (100x less, the playground kills your process otherwise) gave me the following:

Time for RefCell implementation: 0.329919269
Time for index implementation: 0.367979916

Multiplying by 100 gives

32.9919269
36.7979916

Which is about the same as C#... Not bad for a commodity computer in the cloud.

Also keep in mind that C# has a pretty good JIT, so I'd expect the two languages to take roughly the same amount of time. Sure, we could squeeze more performance out of the Rust implementation without much difficulty but it's nice to compare directly equivalent implementations in both languages.

1 Like

Don't use Weak, if you don't deal with cyclic references. All it does is decrease performance, because it incurs an extra check for the upgrade operation. Maybe, the compiler is able to optimize the check away, but I wouldn't count on that.

2 Likes

Nah, Rust is much faster. On my old Intel x86-64 box:

$ mcs person.cs -optimize
person.cs(59,13): warning CS0219: The variable `result' is assigned but its value is never used
Compilation succeeded - 1 warning(s)
$ ./person.exe
Time for C# implementation: 1.9223967
$ cargo run  --release
    Finished release [optimized] target(s) in 0.02s
     Running `target/release/junk`
Time for RefCell implementation: 0.2398109
Time for index implementation: 0.242346
$

I changed to 40_000_000 to 400_000 because:

  1. It fails with out of memory at such a large number on this machine.
  2. The compile mcs complier complains you have a bug in your code:
person.cs(52,25): warning CS0652: A comparison between a constant and a variable is useless. The constant is out of the range of the variable type `int'

Where line 52 is using that 40_000_000:

 for (int i = 0; i < 40_000_000; i++)

Which I find odd because a C3 int should go to 2 billion. But I know nothing of C#

3 Likes

Could you demonstrate some ways of optimizing my code? I'm pretty new at Rust...

1 Like

I did. It was 10x slower on debug mode.

1 Like

Something does not add up here.

Given that both your implementations in Rust are about the same speed and both of them are 10 times faster than the C# version, as presented, see above. And given that you are new to Rust I might suggest that you should be overjoyed at being able to get 10 times the performance for little effort rather than worrying how to optimize what you have further. There is no doubt much more to learn you could spend your time on.

Of course one should be sure one is measuring performance correctly before worrying about optimization. There is something wrong with your claims given the code your presented and the results I get for it.

I tried cranking that constant up to 4_000_000.

$ cargo run  --release
   Compiling junk v0.1.0 (/mnt/c/Users/michael/conveqs/junk)
    Finished release [optimized] target(s) in 1.11s
     Running `target/release/junk`
Time for RefCell implementation: 3.4296604
Time for index implementation: 3.4996294

But that caused the C# version to hang up my machine for ten minutes before I killed so that I could write this post.

What's up with that "constant out of range" error in the C# anyway?

P.S. Sorry for the Rust code being called "junk" above. It's just the name I give to all kind of little experiments that I know I will not want to keep. Kind of like "temp".

1 Like

I think they meant to express, that they already ran the benchmark in release-mode and the code is 10× slower in debug-mode, i.e. C# is still faster for them. I can't say I'm 100% certain, though. The reply was a bit ambiguous.

Yea. I'm confused too.

Anyway, even in debug mode Rust matches the speed of C# here:

$ mcs person.cs -optimize
person.cs(59,13): warning CS0219: The variable `result' is assigned but its value is never used
Compilation succeeded - 1 warning(s)
$ ./person.exe
Time for C# implementation: 1.9185807
$ mcs person.cs
person.cs(59,13): warning CS0219: The variable `result' is assigned but its value is never used
Compilation succeeded - 1 warning(s)
$ ./person.exe
Time for C# implementation: 1.9203245
$ cargo run  --release
    Finished release [optimized] target(s) in 0.02s
     Running `target/release/junk`
Time for RefCell implementation: 0.2427744
Time for index implementation: 0.244792
$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/junk`
Time for RefCell implementation: 1.3872268
Time for index implementation: 1.0179447
$

Something does not add up in what we are being told here.

Yeah, that's what I meant. The debug version was ~20x slower than C# and the release one was 2x slower.

Try adding the file .cargo/config with the following contents:

[build]
rustflags = ["-C", "target-cpu=native"]

I imagine C# might optimize the loop to use vector instructions, which would explain why the Rust code runs slower for you. 40 million people translates to a lot of time for C# to optimize, after all.

I use Visual Studio for compiling (I'm running on Windows) - it could be different underlying implementations (even though int should be 32-bit regardless, so it's weird to me).

This didn't seem to improve anything by much.

However, it turns out I left the arrays from previous result in memory - 40,000,000 element-long arrays seem to make it hard for the program to allocate more memory (after 4 tests, VS code crashed).
So I dropped the result after each test, and it turns out the index implementation runs for less time than C#:

Time for no-weak implementation: 39.1573103
Time for index implementation: 24.5929696
Time for RefCell implementation: 58.675045

I also tried not using Weak like you suggested and it seemed to work pretty well.

I have another question then:
Could it be better for me to just return the Vec<Rc<RefCell<Person>>> instead of iterating over the whole array and unwrapping the Rc and RefCell, even though I'd rather not have them, because it's faster?

What's the usual way to handle a situation like this? Or does it depend on the rest of the program?

A wild guess, without benchmarking -- from what I've heard C# has good implementation of containers and my suspicion is that calling push() repeatedly is where Rust implementation of your program is losing time because of memory (re)allocations.

Instead of vec![] you could try to call Vec::with_capacity() as a quick solution, playing with different values and see how that affects the outcome.

Another choice is to re-design the algorithm where you could have a Vec<&OccupationGroup> for each Person (to avoid calling 40_000_000 push-es) or the &OccupationGroup could be a field in Person (probably as Option<&OccupationGroup>). There are other options of course.

1 Like

I extended @Michael-F-Bryan's playground with some implementations that consume the Map to fill the Vec instead of using shared ownership, and they show a significant speedup:

Time for RefCell implementation: 0.322939948
Time for index implementation: 0.391787643
Time for consuming hashmap implementation: 0.19059109
Time for consuming btree implementation: 0.164383473

You could also change the function signature to return impl Iterator<Item=OccupationGroup<'a>> which would avoid the Vec allocation entirely if this is used to drive a for loop.


New implementations:

fn group_by_occupation_consuming_hash(people: &Vec<Person>) -> Vec<OccupationGroup> {
    let mut occupation_map: HashMap<&str, OccupationGroup> = HashMap::new();
    
    for person in people {
        let occupation_group = occupation_map
            .entry(person.occupation.as_str())
            .or_insert_with(|| {
                OccupationGroup{
                    people: vec![],
                    occupation: &person.occupation
                }
            });
        occupation_group.people.push(person);
    }
    occupation_map.into_iter().map(|(_,v)| v).collect()
}

fn group_by_occupation_consuming_btree(people: &Vec<Person>) -> Vec<OccupationGroup> {
    use std::collections::BTreeMap;
    let mut occupation_map: BTreeMap<&str, OccupationGroup> = BTreeMap::new();
    
    for person in people {
        let occupation_group = occupation_map
            .entry(person.occupation.as_str())
            .or_insert_with(|| {
                OccupationGroup{
                    people: vec![],
                    occupation: &person.occupation
                }
            });
        occupation_group.people.push(person);
    }
    occupation_map.into_iter().map(|(_,v)| v).collect()
}

(Playground)

This is a simplified algorithm for the case I actually want, and in it, multiple occupations could point to the same OccupationGroup (and modify it).
Would shared ownership be required in that case?

No, as long as you only want to return each occupation group once, and can generate the key for the group from the occupation string: they're all stored in the map and are freely editable until you call into_iter. At that point, the map has been consumed and the only way to access the groups is to work through the iterator.

I didn't read your code super closely, but running it with a different allocator gives the 2x improvement that you couldn't find an explanation for.

Try running it with mimalloc: https://github.com/purpleprotocol/mimalloc_rust

It probably comes down to many small allocations that you are doing.

By default, Rust uses the default system allocator, which is not that great on windows. Also GC'd languages work pretty well for programs that make many small allocations, because they will preallocate a big chunk of memory and then use that.

And that is what I don't understand. Because when I run the code you presented the Rust version is 10 times faster than the C# in the release build. It is also faster than C#, by a bit, in the debug build.

I have never found "target-cpu=native" to make much difference. Certainly not a factor of 10. But then I have a pretty old PC.

Any C# heads out there who can confirm, or otherwise, these results?

1 Like