Sorting hashmap of vectors by sum, and some more

Hey!
I'm working on a project where I consolidate some data as Hashmap<String, Vec<f32>>. All of the vectors have the same number of elements in them. I need to sort this hashmap, based on sum of the elements in the vector, and later sum of elements of the vector excluding one of the elements iteratively.

trg_map.retain(|_, v| v.iter().filter(|&x| x > &sim_cut).count() > 4); // filter hashmap 

for (_, val) in trg_map.iter_mut() {
    let sum = val.iter().sum();
    val.push(sum);
    for lineage in 0..lineage_count {
      val.push(sum-val[lineage]);
    }
}
let mut map: Vec<(&String, &Vec<f32>)> = trg_map.iter().collect();
map.sort_by(|a,b| a.1[lineage_count].partial_cmp(&b.1[lineage_count]).unwrap());
// output top n entries
// Similarly sort based on sum calculated in the following entries to the vector.

The first idea that struck me was to add these sums to the vector itself, and then sort based on that value, but I was wondering if there any better solutions to carrying this out? In specific,

  • I was wondering if there is an idiomatic way to sort without adding those values to the vector
  • Is it possible to refer a particular element in the vector as a part of the closure? i.e. aiding in calculation of sum without the whole for loop.

Thank you,
aa

Any time you want to sort by something that's non-trivial to calculate, the default answer is https://doc.rust-lang.org/std/primitive.slice.html#method.sort_by_cached_key.

That's a bit harder here, since f32: !Ord, but you can solve that with something like Noisy_float — Rust math library // Lib.rs :

.sort_by_cached_key(|p| NoisyFloat::from(p.1.sum::<f32>()))
4 Likes

Thanks a lot :slightly_smiling_face:, I hadn't looked at sort_by_cached_key in this context earlier, I'll keep this in mind! But I should be using sort_by in my case, since I need to calculate the sum of vectors to get the sum of vectors when I drop out one element at a time, right?

Sometimes it is easier to use a structure. Maybe it does not fully answer to your problem, but it might help.

use std::collections::HashMap;

// We need a lifetime for this struct because it holds some references.
// The lefetime of the struct can not be greater than the lifetime of the map
struct MapSum<'map> {
    // You should prefer &str over &String
    key: &'map str,
   // Same for &Vec<_> and &[_]
    val: &'map [f32],
   // the current value of the sum
    sum: f32,
   // index to the next val to exclude
    cursor: usize
}

impl<'map> MapSum<'map> {
    pub fn new<'a>(key: &'a str, val: &'a [f32]) -> MapSum<'a> {
        let sum = val.iter().sum();
        MapSum{key, val, sum, cursor: 0usize}
    }
    
    pub fn has_next(&self) -> bool {
        self.cursor < self.val.len()
    }
    
    // Here we update the sum by retrieving to the sum the val at the pos 'self.cursor'
    // Should not be called more than self.val.len()
    pub fn next(&mut self) {
        assert!(self.has_next());
        self.sum -= self.val[self.cursor];
        self.cursor += 1;
        // If you only need to remove one element at a time
      /*
       if self.cursor == 0 {
          self.sum -= self.val[self.cursor];
       } else {
          self.sum += self.val[self.cursor - 1];
          self.sum -= self.val[self.cursor];
      }
      self.cursor += 1;
      */
    }
}

pub fn sort_map(trg_map: &mut HashMap<String,Vec<f32>>) {
    //trg_map.retain(|_, v| v.iter().filter(|&x| x > &sim_cut).count() > 4); // filter hashmap 
    let mut map_sum: Vec<_> = trg_map.iter()
        .map(|(key, val)| MapSum::new(key, &val[..]))
        .collect();

    // now we can sort by accessing the sum field of the structs
    map_sum.sort_by(|a,b| a.sum.partial_cmp(&b.sum).unwrap() );
    
    // we update just by calling next
    map_sum.iter_mut().for_each(|m| m.next());
    // sort again
    map_sum.sort_by(|a,b| a.sum.partial_cmp(&b.sum).unwrap() );
}

Here a link to a blog post that explain the differences between String, &str. The relation is similar for Vec<_> and &[_] (slice).

2 Likes

Hey Vincent, apologies for the delayed response. This approach is interesting, so you're suggesting I convert the hashmap to a struct after I complete all my operations with the data. Interesting, I'll give it a shot.

I was also facing issues with the values being f32::NaN, and I keep filtering them in place. I guess, using this struct, I could filter them at the initialisation stage, wonderful! Thanks a lot for taking your time for this detailed response :blush:.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.