[Solved] Merge multiple sorted vectors using iterators


#1

“Shepmaster” published a very elegant implementation of an merging iterator MergeAscending.
It takes to iterators as input and outputs something like
MergeAscending<std::slice::Iter<'_, &_>, std::slice::Iter<'_, _>> : std::iter::Iterator

Here the souce code:

use std::iter::Peekable;
use std::cmp::Ordering;

struct MergeAscending<L, R>
    where L: Iterator<Item = R::Item>,
          R: Iterator,
{
    left: Peekable<L>,
    right: Peekable<R>,
}

impl<L, R> MergeAscending<L, R>
    where L: Iterator<Item = R::Item>,
          R: Iterator,
{
    fn new(left: L, right: R) -> Self {
        MergeAscending {
            left: left.peekable(),
            right: right.peekable(),
        }
    }
}

impl<L, R> Iterator for MergeAscending<L, R>
    where L: Iterator<Item = R::Item>,
          R: Iterator,
          L::Item: Ord,
{
    type Item = L::Item;

    fn next(&mut self) -> Option<L::Item> {
        let which = match (self.left.peek(), self.right.peek()) {
            (Some(l), Some(r)) => Some(l.cmp(r)),
            (Some(_), None)    => Some(Ordering::Less),
            (None, Some(_))    => Some(Ordering::Greater),
            (None, None)       => None,
        };

        match which {
            Some(Ordering::Less)    => self.left.next(),
            Some(Ordering::Equal)   => self.left.next(),
            Some(Ordering::Greater) => self.right.next(),
            None                    => None,
        }
    }
}




fn main() {
    let left  = [1, 3, 5, 7, 9];
    let right = [3, 4, 5, 6, 7];

    let result: Vec<_> = MergeAscending::new(left.iter(), right.iter()).collect();
    println!("{:?}", result);

I would like to use this code to merge more then one vector. The following works:

fn main() {
let vv: Vec<Vec<_>> = vec![vec![1, 3, 5, 7, 9], vec![3, 4, 5, 6, 7],vec![0, 2, 6, 8]];

let a = MergeAscending::new(vv[0].iter(),vv[1].iter());
let b = MergeAscending::new(a, vv[2].iter());
let result: Vec<_> = b.collect();

println!("{:?}", result);
}

Unfortunately a is of type :
MergeAscending<std::slice::Iter<'_, _>, std::slice::Iter<'_, _>>
which is different from the type of b which is
MergeAscending<MergeAscending<std::slice::Iter<'_, _>, std::slice::Iter<'_, _>>, std::slice::Iter<'_, _>>.
I suppose this is why the following does not work:

fn main() {
    let vv: Vec<Vec<_>> = vec![vec![1, 3, 5, 7, 9], vec![3, 4, 5, 6, 7],vec![0, 2, 6, 8]];

    let ma =  MergeAscending::new(vv[0].iter(),vv[1].iter());    
        for v in vv.iter().skip(2) {
            ma = MergeAscending::new(ma, v.iter()); //error: mismatched types
        };

        let result: Vec<_> = ma.collect();
}

Another not working approach:

fn main() {
    let vv: Vec<Vec<_>> = vec![vec![1, 3, 5, 7, 9], vec![3, 4, 5, 6, 7],vec![0, 2, 6, 8]];

    let mut result:Vec<_> = MergeAscending::new(vv[0].iter(),vv[1].iter()).collect() ; 
        for v in vv.iter().skip(2) {
            result = MergeAscending::new(r.iter(), v.iter()).collect::<Vec<_>>();
    }
}

How can one fix this? What do you think about memory efficiency and performance of this approach?
Better ideas?


#2

The performance can be asymptotically improved if you used heap for merging, like Python’s heapq does. Current approach is O(N * k), the heap will bring this down to O(N * log(k)), where N is the total length of iterables and k is the number of iterables.

And to make this actually work in general case, you can use trait objects. A less general, but more efficient approach would be to create a merging macro: merge!(xs, ys, zs).


#3

Thank you for your hint about complexity. These nested iterators are lazy and I have difficulties to imagine how and in which order comparison will happen. Anyway my main concern is not performance (yet) I need to get this work.

In addition I do not know at compile time how may vectors I will have to merge! I suppose this excludes creating a merging macro, isn’t it so?

I am not advanced enough in Rust to understand how trait objects are linked to my problem.

I hoped there was a “simple” mistake in MergeAscending as I get ‘mismatched types’ all the time.

I observered that I can not nest MergeAscending properly in a loop because every time I get a different type out of MergeAscending::new().


#4

In addition I do not know at compile time how may vectors I will have to merge! I suppose this excludes creating a merging macro, isn’t it so?

Yes.

I observered that I can not nest MergeAscending properly in a loop because every time I get a different type out of MergeAscending::new().

Yes, you are absolutely correct why the loop does not typecheck. Trait objects are the thing that allows to use the same type after every merge, at the cost of heap allocations and indirect calls:

fn main() {
    let vv: Vec<Vec<_>> = vec![vec![1, 3, 5, 7, 9], vec![3, 4, 5, 6, 7],vec![0, 2, 6, 8]];

    // We sort of erase types here, and go from concrete `MergeAcending` to an abstract heap allocated `Iterator` 
    let mut ma: Box<Iterator<Item=i32>> = Box::new(MergeAscending::new(vv[0].iter().cloned(),vv[1].iter().cloned()));
    for v in vv.iter().skip(2) {

        ma = Box::new(MergeAscending::new(ma, v.iter().cloned()));
    };
    println!("{:?}", ma.collect::<Vec<_>>());
}

#5

Thank you a lot! I spent the whole day on this and I was about to give up. You made my day!

Your solution does the following:

  1. box::new every MergeAscending
  2. clone() every iter()

I remember having read somewhere that iterators do not allocate memory? Is this right?
How much memory is allocated in this example seeing that the program runs through 3 iter().cloned() ?
What is stored in there and when?
ma.collect() also allocates memory. What is stored in there? The result vector or a vector of pointers into
the input vectors?


#6

The iter.cloned() goes from Item=&T to Item=T. Given that T is i32 here, &T is twice as large (assuming 64 bit pointers) then T, so cloned actually saves 4 bytes of memory here (modulo potential padding) :slight_smile: I think this example can be made to work with references as well, I’ve just used cloned for simplicity.

I remember having read somewhere that iterators do not allocate memory? Is this right?

Yes, unless you use dynamic dispatch and trait objects. The tradeoff is like this:

  • Concrete types give you zero allocation and maximum performance, because the compiler can inline every call.
  • Trait objects give you more flexibility, but they are slower because of indirect calls and often times require heap allocations.

But even with trait objects iterators are lazy, and don’t keep everything in memory. That is, this still requires memory proportional to k and not to N.

As for the .collect, I’ve used it because currently you can’t use two traits for a trait object (that is, I can’t let mut ma: Box<Iterator<Item=i32> + ::std::fmt::Debug>, this can be worked around by trait DebugIter: Iterator + Debug). It allocates the vector of integers.

I usually thing about Rust iterators in the following way: when you chain filters, maps, e.t.c, your are building a datastructure (or a program), describing iteration process. Only when you .collect, or for x in the program gets executed (sortof compiled, if you use concrete iterator types, and sortof interpreted if you use trait objects).


#7

Thank you for your instructive explications! I packed the code in a macro you will find below.

BTW: This morning I looked at the problem with more distance and came to the conclusion the whole approach is not generic enough. To get it right and beautiful the iterator MergeAscending needs to be modified in a way to accept a vector of iterators as input instead of only 2 iterators. The algorithm for next() should then be:

  1. inspect all elements that are next in the different queues (without pulling them just peek())
  2. find the queue with the smallest element
  3. pull that element and return it.

The performance of this will depend on how efficient the find operation in 2. is implemented. At this point there is a lot room for smart optimisations, considering that:

  • queues can have a very different size
  • after pulling out the smallest, the resting elements are still in the same order.

Unfortunate I am not proficient enough in Rust yet, for the moment the macro solution (see above) will do. Thanks again for your help.


#8

Here the macro code I ended up with:

macro_rules! merging_iterator_from {  
    ($vv: ident) => {{    
        let mut ma: Box<Iterator<Item=_>> = Box::new($vv[0].iter().map(|&i|i));
        for v in $vv.iter().skip(1) {
            ma = Box::new(MergeAscending::new(ma, v.iter().map(|&i|i)));
        };
        ma
    }}
}

fn main() {
    let vv: Vec<Vec<_>> = 
        vec![vec![1, 3, 5, 7, 9], vec![3, 4, 6, 7], vec![0, 6, 8],vec![1, 2, 12], vec![10]];

    let mi = merging_iterator_from!(vv);
    let result = mi.collect::<Vec<_>>();
    println!("Example 2: {:?}", result);
    //prints: Example 2: [0, 1, 1, 2, 3, 3, 4, 5, 6, 6, 7, 7, 8, 9, 10, 12]
}

#9

You can also find an implementation of the heap merge algorithm that @matklad mentioned in the itertools crate: https://bluss.github.io/rust-itertools/doc/itertools/free/fn.kmerge.html


#10

@bsteinb Do I understand correctly that itertools version does not use std::collections::BinaryHeap because the lack of decrease_key operation? Is there a Rust issue for adding this method? https://github.com/rust-lang/rust/issues/28147 is similar, but this use case is not mentioned there…


#11

@bsteinb This is exactly I was looking for! Thank you.


#12

Yes. The initial version did use BinaryHeap from std, but was later rewritten to use a hand crafted heap that supports the missing operation. See https://github.com/bluss/rust-itertools/commit/18278c5bffa700216566b609f3acb851d74724c8 and https://github.com/bluss/rust-itertools/commit/df0241f4cfe562479f2af954a3576c1bdcb74cab

I did explore some designs to add key modification methods to BinaryHeap which are described here https://github.com/contain-rs/discuss/issues/13


#13

Thank you all for your help with merging vectors. Here a link to the product I needed it for:

Github: getreu/stringsext
A Unicode enhancement of the GNU strings-tool with additional features.