How to improve performance to 2 seconds?

~/hello_cargo$ time cargo build --release
~/hello_cargo$ time ./target/release/hello_cargo 
100

real	0m3.384s
user	0m3.296s
sys	0m0.088s

main.rs code:

use std::collections::HashMap;

fn main() {
    let mut i = 0;
    let mut m = HashMap::new();
    while i < 10000000 {
        let index = i % 1000000;

        let mut key = String::from("a");
        key.push_str(&index.to_string());

        if let Some(value) = m.get(&key) {
            m.insert(key, value + 10);
        } else {
            m.insert(key, 10);
        }

        i += 1;
    }

    match m.get(&String::from("a999999")) {
        Some(&number) => println!("{}", number),
        _ => println!("Don't have number."),
    }
}

main.go code:

package main

import (
	"fmt"
	"strconv"
)

func main() {
	m := make(map[string]int)

	for i := 0; i < 10000000; i++ {
		index := i % 1000000
		m["a"+strconv.Itoa(index)] += 10
	}

	fmt.Println(m["a999999"])
}
~$ time ./main
100

real	0m2.529s
user	0m2.799s
sys	0m0.064s
1 Like

Please put backtick fences around your code so it formats properly.

```
fn main() {
    // Code goes here
}
```

Thanks for reminding

Some opportunities for speeding up your code:

  • push_str can re-allocate if it exceeds the capacity of the String, so allocate the key up front with the space you need.
  • index.to_string() allocates, but then you discard the string.
  • Consider using format! to ensure that only one allocation happens for each key. (This might resolve both of this above issues).
  • Look into the .entry() interface on HashMap to avoid indexing into the hash map twice.

Here's the direct translation of your Go code to Rust. The printing code becomes somewhat verbose to match with the Go's map indexing semantics.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3a809dd584e67555a53d23fae811a9f2

You can improve the performance of this code by replacing ::new() with ::with_capacity(1000000), though it may not that significant.

Significant performance improvements, thanks. Is there room for improvement in performance?

~/hello_cargo$ time ./target/release/hello_cargo 
100

real	0m2.609s
user	0m2.565s
sys	0m0.045s

main.rs code:

use std::collections::HashMap;

fn main() {
    let mut m = HashMap::<String, isize>::new();
    
    for i in 0..10000000 {
        let index = i % 1000000;(
        *m.entry(format!("a{}", index)).or_default()) += 10;
    }
    
    println!("{}", m.get("a999999").copied().unwrap_or_default())
}

~/hello_cargo$ time ./target/release/hello_cargo 
100

real	0m2.475s
user	0m2.435s
sys	0m0.040s
use std::collections::HashMap;

fn main() {
    let mut m = HashMap::<String, isize>::with_capacity(1000000);
    
    for i in 0..10000000 {
        let index = i % 1000000;(
        *m.entry(format!("a{}", index)).or_default()) += 10;
    }
    
    println!("{}", m.get("a999999").copied().unwrap_or_default())
}

~$ time ./main
100

real	0m2.177s
user	0m2.442s
sys	0m0.064s
package main

import (
	"fmt"
	"strconv"
)

func main() {
	m := make(map[string]int, 1000000)

	for i := 0; i < 10000000; i++ {
		index := i % 1000000
		m["a"+strconv.Itoa(index)] += 10
	}

	fmt.Println(m["a999999"])
}

Another observation: Your current code creates each key 10 times. Create each one once, outside the loop, and use those.

You might want to try different hashers, as the default secure hasher is not the fastest. The std implementation is built on the hashbrown crate, and if you use that directly you'll get a faster ahash instead. There are more hasher options on crates.io too: fnv, fxhash, xxhash, etc.

1 Like

It looks like that Go default hasher uses FNV algorithm. As @cuviper said, different hashing algorithm have different pros and cons.

Probably for the best comparison you should use one of the available implementations.

use std::fmt::Write;

fn main() {
    // Faster hasher
    // Use Box<str> to avoid storing capacity
    // use a size small enough for the values
    let mut m = fxhash::FxHashMap::<Box<str>, u32>::with_capacity_and_hasher(1000000, Default::default());
    // Reuse one string allocation across loop iterations
    let mut k = String::with_capacity(30);

    for i in 0..10000000 {
        let index = i % 1000000;
        // clear + write! is like an in-place format!
        k.clear();
        let _ = write!(k, "a{}", index);
        // avoids allocating the string 9 out of 10 times
        if let Some(v) = m.get_mut(k.as_str()) {
            *v += 10;
        } else {
            m.insert(k.clone().into_boxed_str(), 10);
        }
    }

    println!("{}", m.get("a999999").copied().unwrap_or_default())
}

I haven't used this (yet) but could be interesting to see how these compare when run through flamegraph: https://github.com/ferrous-systems/flamegraph/blob/master/README.md

FWIW, testing with hyperfine on a Skylake machine, the "boxed str" part produces no statistically significant performance improvement, so I would suggest this simplification:

use std::fmt::Write;

fn main() {
    let mut m = fxhash::FxHashMap::with_capacity_and_hasher(
        1_000_000,
        Default::default(),
    );
    let mut k = String::with_capacity(30);

    for i in 0..10_000_000 {
        k.clear();
        let _ = write!(k, "a{}", i % 1_000_000);

        if let Some(value) = m.get_mut(&k) {
            *value += 10;
        } else {
            m.insert(k.clone(), 10);
        }
    }

    println!("{}", m.get("a999999").unwrap_or(&0));
}

On my machine this is 1.55x faster than the Rust original. (Obviously I can't compare my times with the OPs in an apples-to-apples comparison.)

(Edit: and 1.36x faster than the Go program.)

1 Like

cool!

~/hello_cargo$ time ./target/release/hello_cargo
100

real 0m1.794s
user 0m1.762s
sys 0m0.032s

1 Like

Thank you for your reply.

~/hello_cargo$ time ./target/release/hello_cargo
100

real 0m1.828s
user 0m1.788s
sys 0m0.041s

1 Like

I've tried it on Kaby Lake and saw a significant difference. The working set with Box should be smaller, so whether there's a visible difference it depends on whether the hashmap fits in your cache.

1 Like

What a nice rabbit hole to fall into!

My first thought was – the data is accessed in ordered way, so let's try (an ordered) BTreeMap instead of hashmap. Turned out it wasn't enough to beat @kornel solution. So I thought – the strings are always small, let's use some kind of small string optimization – and grabbed an arraystring crate. Faster, but still not there.

So for a fun experiment decided to increase the dataset size ten times – BtreeMap solution finished in 46 seconds and hashmap needed 10 minutes. I guess the main conclusion from that is that CPU caches are insanely big these days. But if your Value gets more complex the cache effect would be more visible.

BTreeMap:

use std::collections::BTreeMap;
use std::fmt::Write;

fn main() {
    // Since we're accessing elements in ordered way, let's use an ordered collection.
    let mut m: BTreeMap<Box<str>, u32> = BTreeMap::new();
    // Reuse one string allocation across loop iterations
    let mut k = String::with_capacity(30);

    for i in 0..10000000 {
        let index = i % 1000000;
        // clear + write! is like an in-place format!
        k.clear();
        let _ = write!(k, "a{}", index);
        // avoids allocating the string 9 out of 10 times
        if let Some(v) = m.get_mut(k.as_str()) {
            *v += 10;
        } else {
            m.insert(k.clone().into_boxed_str(), 10);
        }
    }

    println!("{}", m.get("a999999").copied().unwrap_or_default())
}

BTreeMap + arraystring

use std::collections::BTreeMap;
use std::fmt::Write;

use arraystring::{ArrayString, typenum::U15};

fn main() {
    // Since we're accessing elements in ordered way, let's use an ordered collection.
    let mut m: BTreeMap<ArrayString<U15>, u32> = BTreeMap::new();

    for i in 0..10000000 {
        let index = i % 1000000;
        let mut k = ArrayString::new();
        k.clear();
        write!(k, "a{}", index).unwrap();
        *m.entry(k).or_default() += 10;
    }

    println!("{}", m.get("a999999").copied().unwrap_or_default())
}

Then I decided to try rolling my own specialized arraystring. Seems to beat kornel's solution slightly. But hashmap also gets faster with custom arraystring, so that leaves a tie, at least on my CPU. Gist

5 Likes