C++ std::set_difference equivalent in Rust

std::set_difference can calculate difference of two sorted collections (e.g., vectors). Rust can do set difference via HashSet but it is not convenient nor efficient if users have sorted vectors in the first place. Has there been discussions on this before?

2 Likes

To clarify, you're looking for an algorithm that can take either a sorted slice or a sorted iterator, and produce - probably an iterator - of Either<T,T>, right?

Not quickly seeing it in the stdlib; it should be fairly easy to write for someone who knows the Iterator boilerplate. Making it robust to unsorted streams is a little harder.

If you need an efficient sorted set, don't use sorted vectors with ad-hoc algorithms; use BTreeSet instead.

In Rust, STL-like ad-hoc algorithms are usually implemented on the Iterator trait. I can't find a sorted difference there, though.

In this Playground, I have prepared a naïve implementation, which assumes a sorted sequence and fused iterators:

struct SortedDiff<T: Iterator<Item = W>, U, W> {
    fst: Peekable<T>,
    snd: U,
    next_sub: Option<W>,
}

impl<T, U, W> Iterator for SortedDiff<T, U, W>
    where
        T: Iterator<Item = W>,
        U: Iterator<Item = W>,
        W: Ord, // or maybe `PartialOrd`, depends on what you want to allow
{
    type Item = T::Item;
    
    fn next(&mut self) -> Option<Self::Item> {
        // only keep first one of runs of equal values in the diminuend
        let dim = self.fst.next()?;
        
        while let Some(next) = self.fst.peek() {
            if dim < *next { break }
            self.fst.next();
        }
        
        // Get next subtrahend. Once exhausted, simply
        // yield remaining elements of the diminuend.
        let sub = match self.next_sub.as_ref() {
            None => return Some(dim),
            Some(sub) => sub,
        };

        if dim < *sub {
            // Diminuend stricty less than subtrahend --
            // they are not equal, so yield the diminuend
            Some(dim)
        } else if dim == *sub {
            // Diminuend equal to current subtrahend --
            // skip this diminuend and look for the next one
            self.next()
        } else {
            // Diminuend greater than current subtrahend --
            // we need to advance to the next subtrahend
            // not greater than the diminuend.
            loop {
                self.next_sub = self.snd.next();
                
                match self.next_sub.as_ref() {
                    None => break Some(dim), // exhausted subtrahend
                    Some(sub) if dim <  *sub => break Some(dim),
                    Some(sub) if dim == *sub => break self.next(),
                    _ => continue,
                }
            }
        }
    }
}
1 Like

I'm a fan of sorted vectors and parallel iterators (more often than not I need to create the collection just once so sorted vectors are perfect) and I was looking for something similar but in the end I implemented my own SortedVec newtype with a couple of convenience methods. I didn't use iterators (my convenience rather than a design choice) but a simple callback function to generate results, something like

impl<E> SortedVec<E> {
    pub fn map_difference<F>(other: &SortedVec<E>, callback: F)
        where F: Fn(&E) {
        ...
    }
}

I feel like this is overcomplicated. For instance, here's two simplified versions:

This one keeps the duplicates from the first iterator like std::set_difference does (playground link):

struct SortedDiff<T, U: Iterator> {
    it1: T,
    it2: Peekable<U>,
}

impl<T, U, W> Iterator for SortedDiff<T, U>
    where
        T: Iterator<Item = W>,
        U: Iterator<Item = W>,
        W: Ord,
{
    type Item = W;
    
    fn next(&mut self) -> Option<Self::Item> {
        while let Some(elm1) = self.it1.next() {
            'inner: loop {
                match self.it2.peek().map(|elm2| elm1.cmp(elm2)) {
                    None => return Some(elm1),
                    Some(Ordering::Less) => return Some(elm1),
                    Some(Ordering::Equal) => break 'inner,
                    Some(Ordering::Greater) => { self.it2.next(); },
                }
            }
        }
        
        None
    }
}

This one removes duplicates in the first iterator like your implementation does (playground link):

struct SortedDiff<T: Iterator, U: Iterator> {
    it1: Peekable<T>,
    it2: Peekable<U>,
}

impl<T, U, W> Iterator for SortedDiff<T, U>
    where
        T: Iterator<Item = W>,
        U: Iterator<Item = W>,
        W: Ord,
{
    type Item = W;
    
    fn next(&mut self) -> Option<Self::Item> {
        'outer: while let Some(elm1) = self.it1.next() {
            if Some(&elm1) == self.it1.peek() {
                continue 'outer;
            }
        
            'inner: loop {
                match self.it2.peek().map(|elm2| elm1.cmp(elm2)) {
                    None => return Some(elm1),
                    Some(Ordering::Less) => return Some(elm1),
                    Some(Ordering::Equal) => break 'inner,
                    Some(Ordering::Greater) => { self.it2.next(); },
                }
            }
        }
        
        None
    }
}

Also, I don't think these solutions assume the iterator is fused.

Here’s a solution using filter and find_map:

use std::cmp::Ordering;

fn sorted_difference<T, U>(left: T, right: U) -> impl Iterator<Item = T::Item>
where
    T: IntoIterator,
    U: IntoIterator<Item = T::Item>,
    T::Item: Ord,
{
    let left = left.into_iter();
    let mut right = right.into_iter();

    let mut right_next = right.next();

    left.filter(move |elem| {
        let check = |other: T::Item| -> Option<(T::Item, bool)> {
            match elem.cmp(&other) {
                Ordering::Less => Some((other, true)),
                Ordering::Equal => Some((other, false)),
                Ordering::Greater => None,
            }
        };
        if let Some((new_right_next, result)) = right_next
            .take()
            .and_then(|o| check(o).or_else(|| right.find_map(check)))
        {
            right_next = Some(new_right_next);
            result
        } else {
            true
        }
    })
}

Edit: Slightly rewritten

use std::cmp::Ordering::*;

fn sorted_difference<T, U>(left: T, right: U) -> impl Iterator<Item = T::Item>
where
    T: IntoIterator,
    U: IntoIterator<Item = T::Item>,
    T::Item: Ord,
{
    let mut right = right.into_iter();
    let mut r_nx = right.next();

    left.into_iter().filter(move |l| {
        let f = |r| match l.cmp(&r) {
            Less => Some((r, true)),
            Equal => Some((r, false)),
            Greater => None,
        };
        r_nx.take()
            .and_then(|r| f(r).or_else(|| right.find_map(f)))
            .map_or(true, |(new_r_nx, keep)| {
                r_nx = Some(new_r_nx);
                keep
            })
    })
}
1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.