How to improve the performance when using HashMap with huge data in rust?

I have been trying to validate Rust for one of my project for past 1 month. So far, it is promising except this one instance.

I am trying to process a huge data. I have 5000 timeseries read from 5000 files and want to process them. I am using hashmap as collection and iterating over it. Due to the nature of my code, I am not able to use par_iter to improve the performance. It takes around 15 mins for the below logic to complete. Any suggestion on how to improve the performance? I am not sure if Hashmap has some bottleneck in processing huge data? Also, should I have to use NDArray1 rather than Vec<f32> for Ts?

use std::collections::HashMap;

type Ts = Vec<f32>;
type FileName = String;
type AllTs = HashMap<FileName, Ts>;

fn main() {
    // Below function reads from a file and returns an Hashmap of len 5000 (That is 5000 Files)
    // Each Ts length is 30000
    // Ts value ranges from 0 - 10
    // I am using HashMap because in someother location of my code I filter the Ts based on filename
    let all_ts: AllTs = load_ts_from_files(); 
    let histogram_bin: Vec<f32> = calculate_bins(all_ts);
    
}

fn load_ts_from_files() -> AllTs {
    // This function reads 5000 files and load the AllTs
    return AllTs::new();
}

fn calculate_bins(all_ts: AllTs) -> Vec<f32> {
    let mut bins = vec![0f32; 10];
    all_ts.iter().for_each(|(file_name, ts)| { // This loops through 5000 times
        for d in ts { //This loops through 30000 times
            let bin = (d/10 as f32) as usize;
            bins[bin] += 1.0;
        }
    });
    return bins;
}

Playground

You can use integers here:

    let scale = 1.0; // adjust this
    let mut bins = vec![0u32; 11]; // adding one extra bin for overflows
    all_ts.iter().for_each(|(file_name, ts)| { // This loops through 5000 times
        for d in ts { //This loops through 30000 times
            let bin = ((d * scale) as usize).min(10); // now bin is at most 10
            bins[bin] += 1;
        }
    });

And don't forget to compile in --release mode.

Thank you. I missed one detail. My bins can have f32. The calculate function can have the value to be added. Sometimes, It might be += 1.5 or +=1.1. I gave a minimal code here

Ah. f32 should be fine too.

Other things to look out for:

  • Read files with BufReader or mmap
  • preallocate your Vecs and HashMap if you know (roughly) their final size by using the with_capacity method. Growing them is surprisingly expensive.
1 Like

Do you know if reading the files or processing them is slower? How are you reading them?

I found that processing is slower. Reading also takes time, but I am not interested in optimising it. Just now, I noticed if I comment the for_loop in my calculate_bins, the processing time is same. I think for_each is the bottleneck. I am changing my to see if converting HashMap to `Vec<(String, Vec)> will help.

You removed the body of the loop, and it still took the same time???

Both. I removed the for_loop and also I removed the body of the for_loop. I think something is happening because of rayon multi-threading
I wanted to keep the example minimal, so didn't give the entire context. The above mentioned main is actually called folder() and I do rayon par_iter() on each folder

I would recommend not using rayon until your code works well. Then you can add it back in to gain extra performance.
Not all problems benefit from it.
And profiling is key! Figure out what part is actually slow.

I tried profiling by adding some print statements. The issue is not with the calculate_bins(). The issue is with folders par_iter(). My code is something like below,

folders.par_iter().for_each(|folder| { //folder is of type Arc<Mutex<T>>
    let a = std::time::SystemTime::now();
    calculate_bins(&folder.AllTs);
    match a.elapsed() {
        Ok(elapsed) => println!("{} Time took for calculate_bins {} secs", std::thread::current().id(), elapsed.as_secs()),
        Err(e) => println!("Error: {:?}", e)
    }
})

The output is

ThreadId(2) - Time took for calculate_bins 0 secs
ThreadId(3) - Time took for calculate_bins 0 secs
ThreadId(2) - Time took for calculate_bins 0 secs
ThreadId(3) - Time took for calculate_bins 0 secs
....... (for 300 times)

However, after printing each line, it takes some time (couple of secs) to print next line. So I think rayon par_iter is doing something which is time consuming. Also, I noticed that it spawns only 2 threads

@selvavm I suspect reading the data from disk is slow.

But I am not reading at this point. The data is already read and stored in Hashmap in load_ts_from_files()

Here I am only processing the data

what happens without rayon?

You should use std::time::Instant instead. SystemTime can go backwards when you change the clock on your system. Instant is monotonic.

2 Likes

I figured what causes the bottleneck. After the calculate_bins(), I dont need the AllTs and thought of releasing it. Since it is part of a struct, I did,

folder.allTs = AllTs::new() //HashMap new

This is taking time which is interesting :slight_smile:

You could try mem::replace to swap that new value in, then rayon::spawn to drop the old value separately, out of sight, out of mind. Of course, if you queue up enough of those spawns it may still block other useful work on the threadpool. You could also create a separate pool for this purpose, perhaps with just a few threads.


FWIW, parallel iterators on the standard HashMap are a bit of a kludge, because rayon can't access the internals of the type to split in parallel. We actually end up collecting it to a Vec and then parallel iterating that. But at least for par_iter() that only means collecting references, so only the number of items has any effect, not the data size.

If you use the hashbrown crate directly (which std is based on), then it has a rayon feature for native parallel iterators. The indexmap crate is another alternative with a rayon feature.

2 Likes

a.old_map = HashMap::new() is rather ineffective way of clearing a collection.because it drops the old map, thus deallocating the old container (often quite expensive).

This also has unfortunate side effect: new insertions will going to have to reallocate the memory which is also very expensive.

The clear method is often a better way.


I tried profiling by adding some print statements.

This isn't profiling at all and oftentimes spectacularly bad replacement for it. I recommend you using a real profiler; perf is a good way to start if you're on linux.

https://rust-lang.github.io/packed_simd/perf-guide/prof/linux.html
https://athemathmo.github.io/2016/09/14/tools-for-profiling-rust.html

2 Likes

Maybe they do want to release that memory, if it really won't be used in the latter phase of the program.

I wouldn't expect the top-level deallocation to be a big deal anyway. Dropping a bunch of indirect allocations for all of the items could take a while, but that will happen with clear too.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.