Playing with generics. Any tips for improvement?

Hi all,
I'm doing a project where I need the elements in a vector to be most equally distributed. I thought this might be a nice piece of code to practice generics on, so I did write the following code: Rust Playground
The compiler did a great job in helping me to achieve the generic form of the functions and types. I followed the clippy advices to improve on it further. Now I would like to have the feedback of the pros. I have a few questions:

  1. I defined the functions to take the vectors as arguments. Is it possible to make it a method of the vector? If I remember correctly I can only define methods for things in the same crate as the thing, correct?
  2. I defined the PrioItem struct, and one function acting on it, within the distribute function, because this is the only place I need it. Is this considered good or bad style?
  3. Is there an intrinsic way of removing duplicates from a list, or is the function I wrote the way to go? Can it be done more efficiently?
  4. Are the tests I wrote ok, or can they be improved in some way?
  5. Any other comments?

Thanks for your review and your time. It is highly appreciated.

First comment: the cloning and the Clone bound in your remove_duplicates are unnecessary.

fn remove_duplicates<T>(list: Vec<T>) -> Vec<T>
where T: Eq + Hash
{
    let mut seen: HashSet<&T> = HashSet::new();
    let filter: Vec<bool> = list.iter().map(|v| seen.insert(v)).collect();
    zip(filter.into_iter(), list.into_iter()).filter_map(|(keep, val)| {
      if keep { Some(val) } else { None }
    }).collect()
}

If the elements are expensive to clone, this is potentially much more efficient.

Or, if your type supports an ordering, you might just .sort() and .dedup() the Vec.

1 Like

Ok, in fact there are quite a few more inefficiencies :slight_smile:

    /// Puts a PrioItem on the Priority heap. The position of insertion of the heap is determined by the priority value
    fn put_on_heap<K>(heap: &mut Vec::<PrioItem<K>>, item: PrioItem<K>) {
        let mut idx = heap.len();
        heap.push(item);
        while idx > 0 { 
            if heap[idx].prio > heap[idx-1].prio || (heap[idx].prio == heap[idx-1].prio && heap[idx].freq <= heap[idx-1].freq) {
                heap.swap(idx, idx-1);
               }
            idx -= 1;
        }

    }   

That's not a heap, it's just a sorted vector. You're doing an insertion sorting essentially, which is O(n^2). A real heap will do this much faster.

    // Create a list of the Items and their count, priority, ... such that the 
    // last element of the list is the next item to be placed
    let mut prioheap = Vec::<PrioItem<T>>::new();
    for ival in remove_duplicates(list.to_vec()).iter() {
        let value = ival;
        let count = list.iter().filter(|x| *x == value).count();

Same here, you're deduplicating the elements first and then traversing the whole vector to count the occurrences of each of them, which is O(n^2). Instead, if you count the occurrences in one pass using a HashMap, it will be O(n).

4 Likes

Let's solve one by one.
A1. Yes. Just define a trait and impletement it for Vec<T>.

trait Distributable {
    fn distribute(&mut self);
}

impl<T: Clone + std::cmp::Eq + std::hash::Hash + std::fmt::Debug> Distributable for Vec<T> {
    fn distribute(&mut self) {
        #[derive(Debug, Clone)]
        struct PrioItem<K> {
            value: K, // The actual item in the list
            count: usize, // Count of items in the list
            freq: f64, // Kind of a frequency with which it will appear in the reordered list
            prio: f64, // Priority with which the this item should be placed next while building the reordered list
        }   
        /// Puts a PrioItem on the Priority heap. The position of insertion of the heap is determined by the priority value
        fn put_on_heap<K>(heap: &mut Vec::<PrioItem<K>>, item: PrioItem<K>) {
            let mut idx = heap.len();
            heap.push(item);
            while idx > 0 { 
                if heap[idx].prio > heap[idx-1].prio || (heap[idx].prio == heap[idx-1].prio && heap[idx].freq <= heap[idx-1].freq) {
                    heap.swap(idx, idx-1);
                   }
                idx -= 1;
            }
    
        }   
    
        // Create a list of the Items and their count, priority, ... such that the 
        // last element of the list is the next item to be placed
        let mut prioheap = Vec::<PrioItem<T>>::new();
        for ival in remove_duplicates(self.to_vec()).iter() {
            let value = ival;
            let count = self.iter().filter(|x| *x == value).count();
            let freq = self.len() as f64 / count as f64;
            let prio = freq/2.0;
            put_on_heap(&mut prioheap,
                PrioItem {
                    value: value.clone(), count, freq, prio
                });
        }   
    
        // Empty the original list an rebuild it based on the priority heap
        self.clear();
        loop {
            match prioheap.pop() {
                Some(mut prioitem) => {
                    self.push(prioitem.value.clone());
                    prioitem.count -= 1;
                    prioitem.prio += prioitem.freq;
                    // Only put the Priority item back on the heap, if elements are left
                    if prioitem.count > 0 { 
                        put_on_heap(&mut prioheap, prioitem);
                    }
                },
                None => break
            }
        }   
    }
}

#[test]
fn distribution() {
    let mut list = vec![0,0,1,1,2,2];
    list.distribute();
    assert_eq!(list, vec![0,1,2,0,1,2]);

    let mut list = vec![0,0,0,0,1,1,1,2,2,3];
    list.distribute();
    assert_eq!(list, vec![0,1,2,0,3,1,0,2,1,0]);

    let mut list = vec![0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,0,0,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,6,6,6,7,7,7,8,8,9,9,10,10,11,12,13,14];
    list.distribute();
    assert_eq!(list, vec![0,1,2,3,4,5,6,7,0,1,10,8,9,2,3,4,0,1,5,2,0,13,14,11,12,6,7,4,3,1,0,2,5,1,0,4,3,10,8,9,2,1,0,7,6,5,4,3,2,1,0]);
}

A2. I think it's good to keep details away from user.
A3. Maybe you can try this:

fn remove_duplicates<T>(mut list: Vec<T>) -> Vec<T> where T: Ord {
    list.sort();
    list.dedup();
    list
}

Updated:
A4 & A5: There were some inefficient points which is mentioned by @jeanas, and the test case distribution is too confusing to catch you. I don't think that's one extractly answer for distribute a vector most equally. Perhaps you could provide more details about what you wanna do.​

I'm doing a project where I need the elements in a vector to be most equally distributed.

2 Likes

Here's a rewrite of your code: Rust Playground

A few notes:

  • Whether you define something as a method (using an extension trait, as @LebranceBW explained) is up to you. (I didn't bother here and just made a function.)

  • It's not a very good idea generally to use floats in algorithms that work on discrete data, because you could be beaten by numerical imprecisions. In this case, this is unneeded.

  • I've added an Ord bound on the type of elements of the vector because it got quite a bit simpler, but you could adapt the code to work with a Hash bound instead.

  • Improving the complexity of your algorithm almost always improves the performance of your program by a great deal. I did a quick test with a vector constructed like this:

    let mut v = vec![];
    for i in 1..3000 {
        for _ in 0..i {
            v.push(i);
        }
    }

Your implementation took 42s on my machine, while my implementation took 0.5s :slight_smile:

Note, though, that they don't yield exactly the same results because the ordering chosen for elements with the same number of occurrences is different.

(I didn't try to understand according to what metric the elements are "the most spread out" in your algorithm, I just implemented something that is [I hope] an equivalent algorithm.)

4 Likes

Thank you for your quick answers.

Can you elaborate on why it is unnecessary? I thought that if I did not clone list and use sort().dedup() then ownership will be transferred and I would loose my original list.

I agree that it is not a heap. The author from which I implemented the algorithm called it a heap. I kept the name. For the algorithm to work though I need to insert the element in specific positions and insertion sort seemed like a good choice, because the rest of the list is already sorted. Its not like I resort it completely every time. Therefore a specialized sort seemed more fitting than simply calling a available sort method, that has more overhead on already sorted lists.

Yes this is a good idea. Thank you.

Thank you. I always forget, that I can put my own traits on stuff, which allows me to implement methods for things.

My goal is to take a vector containing multiple elements of a few values and then reorder them such that "clusters" of same values are minimized. To quantify it you could say: Take the value x and compute the distance to the next x value (for this periodicity of the vector is assumed). Minimize the variance of all x-x distances. E.g. [0,0,1,1,2,2] has a high variance of distances of like values. [0,1,2,0,1,2] has the variance of distances minimized. The values are distributed most equally among the vector.

In the implementation you sent, you are already losing ownership of the original list.

fn remove_duplicates<T>(list: Vec<T>) -> Vec<T> where T: Clone + std::cmp::Eq + std::hash::Hash + std::fmt::Debug {
    let mut seen = HashSet::new();
    let mut uniques = list.clone();
    uniques.retain(|c| seen.insert(c.clone()));
    uniques
}

Since this function takes a Vec<T> (as opposed to &Vec<T> or &mut Vec<T> or &[T]), it takes ownership of list.

Thank you.
I will study your solution in detail (and hopefully understand it).

Yes, you are correct. I thought I put a reference to list in the function definition. I must have forgot. Thank you for pointing it out. Then I wonder, why it still works. I work on list afterwards and the borrow checker did not complain.

This doesn't really work. Suppose you want to insert an element into a vector of n elements that is already sorted. For simplicity, assume you need to insert it in the middle. Then you will have to shift all n/2 elements on its right. That's why you still have an O(n) cost. This is a fundamental limitation with the vector data structure where elements are stored contiguously in memory.

You wrote remove_duplicates(list.to_vec()), which is equivalent to remove_duplicates(list.clone()).

That was due to the hint by the compiler. Probably due to my faulty function definition. I used it without thinking about why the compiler would suggest it. Now I know. Thank you

Based on @jeanas 's work, I wrote down my solution here. By the way, is [0, 1, 2, 3, 0, 1, 2, 0, 1, 0] has the same variance of distances as [0, 0, 1, 0, 1, 2, 0, 1, 2, 3] ? It depends whether you take it like a ring or just array(which influence the distance calculation).

use std::collections::BTreeMap;

fn occurrences<T>(vec: Vec<T>) -> BTreeMap<T, usize>
    where
        T: Eq + Ord,
{
    let mut res = BTreeMap::new();
    for elt in vec {
        *res.entry(elt).or_insert(0) += 1;
    }
    res
}

pub fn distribute<T>(vec: Vec<T>) -> Vec<T>
    where
        T: Eq + Ord + Clone,
{
    let mut tree = occurrences(vec);
    let mut res = vec![];
    while !tree.is_empty() {
        tree.retain(|k, count| {
            if *count == 0 {
                return false;
            }
            *count -= 1;
            res.push(k.clone());
            return true;
        });
    };
    res
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_occurrences() {
        let list = vec![0, 1, 2, 1, 2, 2];
        let occs = occurrences(list);
        assert_eq!(occs, BTreeMap::from([(0, 1), (1, 2), (2, 3)]));
    }
    #[test]
    fn distribution() {
        let list = vec![0, 0, 0];
        assert_eq!(distribute(list), vec![0, 0, 0]);

        let list = vec![0, 0, 1, 1, 2, 2];
        assert_eq!(distribute(list), vec![0, 1, 2, 0, 1, 2]);

        let list = vec![0, 0, 0, 0, 1, 1, 1, 2, 2, 3];
        // [0, 1, 2, 3, 0, 1, 2, 0, 1, 0]
        assert_eq!(distribute(list), vec![0, 1, 2, 3, 0, 1, 2, 0, 1, 0]);
    }
}

Updated: Simple is good

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.