I have been trying to validate Rust for one of my project for past 1 month. So far, it is promising except this one instance.
I am trying to process a huge data. I have 5000 timeseries read from 5000 files and want to process them. I am using hashmap as collection and iterating over it. Due to the nature of my code, I am not able to use par_iter to improve the performance. It takes around 15 mins for the below logic to complete. Any suggestion on how to improve the performance? I am not sure if Hashmap has some bottleneck in processing huge data? Also, should I have to use NDArray1 rather than Vec<f32> for Ts?
use std::collections::HashMap;
type Ts = Vec<f32>;
type FileName = String;
type AllTs = HashMap<FileName, Ts>;
fn main() {
// Below function reads from a file and returns an Hashmap of len 5000 (That is 5000 Files)
// Each Ts length is 30000
// Ts value ranges from 0 - 10
// I am using HashMap because in someother location of my code I filter the Ts based on filename
let all_ts: AllTs = load_ts_from_files();
let histogram_bin: Vec<f32> = calculate_bins(all_ts);
}
fn load_ts_from_files() -> AllTs {
// This function reads 5000 files and load the AllTs
return AllTs::new();
}
fn calculate_bins(all_ts: AllTs) -> Vec<f32> {
let mut bins = vec![0f32; 10];
all_ts.iter().for_each(|(file_name, ts)| { // This loops through 5000 times
for d in ts { //This loops through 30000 times
let bin = (d/10 as f32) as usize;
bins[bin] += 1.0;
}
});
return bins;
}
let scale = 1.0; // adjust this
let mut bins = vec![0u32; 11]; // adding one extra bin for overflows
all_ts.iter().for_each(|(file_name, ts)| { // This loops through 5000 times
for d in ts { //This loops through 30000 times
let bin = ((d * scale) as usize).min(10); // now bin is at most 10
bins[bin] += 1;
}
});
Thank you. I missed one detail. My bins can have f32. The calculate function can have the value to be added. Sometimes, It might be += 1.5 or +=1.1. I gave a minimal code here
I found that processing is slower. Reading also takes time, but I am not interested in optimising it. Just now, I noticed if I comment the for_loop in my calculate_bins, the processing time is same. I think for_each is the bottleneck. I am changing my to see if converting HashMap to `Vec<(String, Vec)> will help.
Both. I removed the for_loop and also I removed the body of the for_loop. I think something is happening because of rayon multi-threading
I wanted to keep the example minimal, so didn't give the entire context. The above mentioned main is actually called folder() and I do rayon par_iter() on each folder
I would recommend not using rayon until your code works well. Then you can add it back in to gain extra performance.
Not all problems benefit from it.
And profiling is key! Figure out what part is actually slow.
I tried profiling by adding some print statements. The issue is not with the calculate_bins(). The issue is with folders par_iter(). My code is something like below,
folders.par_iter().for_each(|folder| { //folder is of type Arc<Mutex<T>>
let a = std::time::SystemTime::now();
calculate_bins(&folder.AllTs);
match a.elapsed() {
Ok(elapsed) => println!("{} Time took for calculate_bins {} secs", std::thread::current().id(), elapsed.as_secs()),
Err(e) => println!("Error: {:?}", e)
}
})
The output is
ThreadId(2) - Time took for calculate_bins 0 secs
ThreadId(3) - Time took for calculate_bins 0 secs
ThreadId(2) - Time took for calculate_bins 0 secs
ThreadId(3) - Time took for calculate_bins 0 secs
....... (for 300 times)
However, after printing each line, it takes some time (couple of secs) to print next line. So I think rayon par_iter is doing something which is time consuming. Also, I noticed that it spawns only 2 threads
I figured what causes the bottleneck. After the calculate_bins(), I dont need the AllTs and thought of releasing it. Since it is part of a struct, I did,
You could try mem::replace to swap that new value in, then rayon::spawn to drop the old value separately, out of sight, out of mind. Of course, if you queue up enough of those spawns it may still block other useful work on the threadpool. You could also create a separate pool for this purpose, perhaps with just a few threads.
FWIW, parallel iterators on the standard HashMap are a bit of a kludge, because rayon can't access the internals of the type to split in parallel. We actually end up collecting it to a Vec and then parallel iterating that. But at least for par_iter() that only means collecting references, so only the number of items has any effect, not the data size.
If you use the hashbrown crate directly (which std is based on), then it has a rayon feature for native parallel iterators. The indexmap crate is another alternative with a rayon feature.
a.old_map = HashMap::new() is rather ineffective way of clearing a collection.because it drops the old map, thus deallocating the old container (often quite expensive).
This also has unfortunate side effect: new insertions will going to have to reallocate the memory which is also very expensive.
I tried profiling by adding some print statements.
This isn't profiling at all and oftentimes spectacularly bad replacement for it. I recommend you using a real profiler; perf is a good way to start if you're on linux.
Maybe they do want to release that memory, if it really won't be used in the latter phase of the program.
I wouldn't expect the top-level deallocation to be a big deal anyway. Dropping a bunch of indirect allocations for all of the items could take a while, but that will happen with clear too.