Why is this code much slower than Java?

In below, the method place_pieces() takes 12+ seconds, but the Java codes with the same logics of exactly sovling the same problem only takes 3.7 seconds, although it outputs the same result as Java version does. I try to pinpoint the bottleneck, so I measured the time each block of codes takes in the core method. What I found is that all blocks are quite time-consuming except for t2 - t1. I don't understand why this is happening and how to improve, please help me out.

One weird thing is that all println!() statements basically print time dif larger than zero in VSCode using cargo run command, but in the command line(Windows 10) they all basically print zero, but the total time taken by place_pieces() method doesn't appear much different after commenting out the four println!() statements in the method place_a_piece(). What the hell is going on?

    fn place_pieces(&mut self, config:&Config2) -> Vec<StateInShape>{
        let mut states_in_shape:Vec<StateInShape>=Vec::new();
        let mut ids:HashSet<&str>=HashSet::with_capacity(90000);
        let mut remaining_piece_ids:Vec<u8>=Vec::new();
        let mut cells_to_occupy:Vec<usize>=Vec::with_capacity(4);
        for piece_num in &self.piece_ids_sorted_by_free_degree {
            remaining_piece_ids.push(*piece_num);
        }
        let mut state_in_shape=StateInShape{
            id:"",
            offsets:[33;10],
            cells:[33;20],
        };

        state_in_shape.id(self, None);
        ids.insert(&state_in_shape.id);
        states_in_shape.push(state_in_shape);
        return self.place_pieces_recursion(states_in_shape, &mut ids, &mut remaining_piece_ids, config, &mut cells_to_occupy);
    
    }

    fn place_pieces_recursion(&mut self, mut states_in_shape:Vec<StateInShape>, ids:&mut HashSet<&str>, remaining_piece_ids:&mut Vec<u8>, config:&Config2, cells_to_occupy:&mut Vec<usize>) -> Vec<StateInShape>{
        println!("Remaining pieces {0:?}, size of states_in_shape {1}", remaining_piece_ids, states_in_shape.len());
        if remaining_piece_ids.len()==0 {
            return states_in_shape;
        }

        let mut temp_states:Vec<StateInShape>=Vec::with_capacity(90000);
        let piece_num=remaining_piece_ids.remove(0);

        let t11=common::current_milliseconds();
        for state in states_in_shape.iter_mut() {
            self.place_a_piece(&mut temp_states, state, piece_num, ids, cells_to_occupy);
        }
        let t22=common::current_milliseconds();
        println!("place_a_piece t22 - t11 = {}", t22 - t11);
        return self.place_pieces_recursion(temp_states, ids, remaining_piece_ids, config, cells_to_occupy);
    }

    
    fn place_a_piece(&self, temp_states:&mut Vec<StateInShape>, state:&mut StateInShape, piece_id_to_place:u8, ids:&mut HashSet<&str>, cells_to_occupy:&mut Vec<usize>) {
        let piece=*self.map_piece_id_to_piece.get(&piece_id_to_place).unwrap();
        for offset in 0..state.cells.len() {
            let t1=common::current_milliseconds();
            // Check if it is a blank cell
            let piece_id=state.cells[offset];
            if piece_id != 33 { 
                continue;
            }

            // This piece cannot be placed at 'offset' due to crossing border
            if !self.map_piece_id_and_offset_to_occupied.contains_key(&(piece_id_to_place, offset as u8)){
                continue;
            }
            let t2=common::current_milliseconds();
            println!("t2 - t1 = {}", t2 - t1);

            // Check if the cells which will be occupied by this piece are all blank
            let mut can_be_placed=true;
            cells_to_occupy.clear();
            for occupied_cell_num in self.map_piece_id_and_offset_to_occupied.get(&(piece_id_to_place, offset as u8)).unwrap() {
                if state.cells[*occupied_cell_num as usize] !=33 {
                    can_be_placed=false;
                    break;
                }
                cells_to_occupy.push(*occupied_cell_num as usize);
            }
            if !can_be_placed {
                continue;
            }
            let t3=common::current_milliseconds();
            println!("t3 - t2 = {}", t3 - t2);

            // Make a new ShapeInState.id prio to spawning a new ShapeInState to avoid duplicated ShapeInState.id
            let mut cells=state.cells.clone();
            for cell in cells_to_occupy.iter_mut() {
                cells[*cell]=piece.id;
            }
            let new_id=state.id(self, Some(&cells));
            if ids.contains(new_id) {
                continue;
            }
            let t4=common::current_milliseconds();
            println!("t4 - t3 = {}", t4 - t3);
            
            // Everything is OK, now spwan a new StateInShape
            let mut new_state_in_shape=state.clone();
            new_state_in_shape.offsets[piece.id as usize]=offset as u8;
            for cell in cells_to_occupy.iter_mut() {
                new_state_in_shape.cells[*cell]=piece.id;
            }
            new_state_in_shape.id(self, None);
            ids.insert(&new_state_in_shape.id);

            // Put a new StateInShape into temp_states to be preared for recursion
            temp_states.push(new_state_in_shape);
            let t5=common::current_milliseconds();
            println!("t5 - t4 = {}", t5 - t4);

        }
    }
2 Likes

The usual check: are you benchmarking with cargo run --release? This enables compiler optimizations and usually makes Rust code much faster.

It would be helpful if you could post or link to self-contained code that we can compile and run ourselves.

8 Likes

The second thing is - if the code heavily depends on performance of hash tables, then use https://lib.rs/fxhash or https://lib.rs/ahash. The std implementation is tuned for security, and not maximally fast.

5 Likes

Thank you, I did use --release.

2 Likes

I doubt the perf of Rust's std::collections::HashMap, so I did a side-by-side experiment in comparing Rust and Java, here is the Rust codes and result:

use std::collections::HashMap;
use crate::utils::{common};

pub fn insert(){
    for _ in 0..10 {
        let mut map:HashMap<i32,i32>=HashMap::with_capacity(1000_0000);
        let t1=common::current_milliseconds();
        for key in 0..1000_0000 {
            map.insert(key, key);
        }
        let t2=common::current_milliseconds();
        println!("{}", t2 - t1);
    }
}

Result:
3655
3683
3769
3711
3722
3753
3770
3791
3738
3765

and the corresponding Java codes are here:

import java.util.HashMap;
import java.util.Map;

public class Test1 {

	public static void main(String[] args) throws Exception {
		for(int k=0; k<10; k++) {
			long s=System.currentTimeMillis();
			Map<Integer,Integer> map=new HashMap(1000_0000);
			for (int i=0; i<1000_0000; i++){
				map.put(i, i);
			}
			long e=System.currentTimeMillis();
			System.out.println((e-s));
		}
	}

}

Result:
2519
2208
2609
1882
2277
2269
2306
2181
4039
2366

This firmly proves that Java outperforms Rust by about 1.5 seconds in this specific case. This is not what I like to see.
I'll try fxhash and ahash and share the result here, thank you very much @kornel.

1 Like

Here is how ahash performs:

Result:
1516
1416
1447
1474
1435
1413
1427
1407
1408
1418

Now Rust beats Java. BTW: Even fxhash's example code doesn't work. I'll replace std HashMap with ahash to see how much it improves.

To be clear, ahash::AHashMap still uses the data structure of std::collections::HashMap, it just replaces the hash function with one that's faster (but less resistant to collision attacks).

Which example code do you mean? Ah, I think I see what you mean: in this snippet from the README

use fxhash::FxHashMap;

let mut hashmap = FxHashMap::new();

hashmap.insert("black", 0);
hashmap.insert("white", 255);

the method FxHashMap::new doesn't actually exist -- you have to use FxHashMap::default() instead. This is fixed on master but there hasn't been a new release since it landed.

5 Likes

Besides fxhash there is rustc_hash. It looks as if they are both basically the same only by different authors. Is this observation correct?

Got it, I'll try it out, thanks.

Here is the code with FxHashMap:

pub fn insert5(){
    for _ in 0..10 {
        let mut map:FxHashMap<i32,i32>=FxHashMap::with_capacity_and_hasher(1000_0000, Default::default());
        let t1=common::current_milliseconds();
        for key in 0..1000_0000 {
            map.insert(key, key);
        }
        let t2=common::current_milliseconds();
        println!("{}", t2 - t1);
    }
}

Result:
2083
2012
2063
2045
2012
2056
2463
2297
2327
2240

Slower than ahash.

That's strange, on my pc fxhash is slightly faster than ahash

std:
1.4304774s
1.3715067s
1.3561414s
1.3745776s
1.3528149s
1.3513092s
1.3578463s
1.3583693s
1.3554816s
1.3570335s
ahash:
781.24ms
782.9202ms
787.7524ms
774.7169ms
773.9259ms
777.8067ms
789.8408ms
788.1501ms
773.1895ms
778.6049ms
fxhash:
700.1331ms
696.4585ms
694.5847ms
702.1735ms
695.3312ms
701.0794ms
690.4338ms
717.2114ms
705.313ms
692.9979ms

Code used:

use std::collections::{HashMap, hash_map::RandomState};
use std::hash::BuildHasher;

fn test_with<S: BuildHasher + Default>() {
    for _ in 0..10 {
        let mut map: HashMap<i32,i32,S> = HashMap::with_capacity_and_hasher(1000_0000, Default::default());
        let now = std::time::Instant::now();
        for key in 0..1000_0000 {
            map.insert(key, key);
        }
        let elapsed = now.elapsed();
        println!("{:?}", elapsed);
    }
}

fn main() {
    println!("std:");
    test_with::<RandomState>();
    println!("ahash:");
    test_with::<ahash::RandomState>();
    println!("fxhash:");
    test_with::<fxhash::FxBuildHasher>()
}

Your code spits out the followings on my machine (Windows 10, AMD CPU), ahash is still the winner with lto=true, but FxHashMap is slightly faster than ahash with lto=false, I cannot explain.

std:
2.3286657s
2.2840485s
2.2921298s
2.2676508s
2.3589571s
2.3622038s
2.3361912s
2.2391665s
2.3062776s
2.2657865s
ahash:
1.4554878s
1.4089814s
1.4092162s
1.4257934s
1.4227454s
1.4852504s
1.4317452s
1.440901s
1.3914775s
1.4550206s
fxhash:
2.0744667s
2.0255156s
2.0662903s
2.0609577s
2.1438759s
2.4481715s
2.0378225s
2.0411953s
2.06705s
2.068054s

You're right, with lto=true I also get this result, strange.

I compiled with this in Cargo.toml

[profile.release]
codegen-units = 1
lto = "fat"

got this results:

899.5525ms
894.9294ms
892.6588ms
875.7494ms
885.6989ms
889.8455ms
888.3089ms
883.6474ms
887.7721ms
870.1368ms
ahash:
803.7645ms
809.0997ms
803.7255ms
811.841ms
806.8419ms
805.8346ms
808.8616ms
791.4187ms
799.0292ms
797.8427ms
fxhash:
786.0468ms
879.6221ms
872.6719ms
871.2568ms
818.9ms
901.276ms
894.9635ms
830.2483ms
776.9295ms
816.7272ms
fnv:
749.1225ms
786.8675ms
790.1068ms
787.2803ms
802.9533ms
792.1892ms
784.7832ms
785.7408ms
795.0478ms
761.8748ms

I think, OP should try fnv crate because it is better than fxhash and ahash for values with less than 8 byte size (for x64 platform) with my options in Cargo.toml

UPD:
Also, fxhash works TWICE better if I reverse order of insertion of integers. Maybe there is an issue with collisions?

fxhash:
418.4611ms
405.8676ms
410.3778ms
427.4673ms
416.6469ms
415.0985ms
422.6911ms
426.3589ms
418.4046ms
422.3179ms

Others work in same speed.

UPD2:
I rewritten cycle this way:

        let now = std::time::Instant::now();
        let _map: HashMap<i32,i32,S> = (0..1000_0000).map(|x|(x,x)).collect();
        let elapsed = now.elapsed();

And this made fxhash and fnv much faster:

fxhash:
417.426ms
439.6707ms
430.0757ms
423.7424ms
424.9925ms
431.1489ms
433.7446ms
416.5275ms
425.89ms
431.0713ms
fnv:
455.1937ms
453.9672ms
454.8467ms
455.9637ms
452.0872ms
460.2296ms
451.9505ms
461.0552ms
452.8283ms
461.3823ms
1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.