Multithreaded filtering of a vector


#1

I have a big Data vector and would like to filter it across many threads, what would be the best way to go about this?

struct Datum {
    id: usize,
    val: String,
}

type Data = Vec<Datum>;

Thread 0 would test the items in the range (0..n), thread 1 would test from (n..2n), so on and so forth.

fn filter(data: Arc<Data>) -> Data {
    let n = 5;
    let (tx, rx) = mpsc::channel();

    for i in 0..2 {
        let (data, tx) = (data.clone(), tx.clone());

        thread::spawn(move || {
            let result: Vec<Datum> = test_datums(datums[i * n..(i + 1) * n]);
            tx.send(result)
        });
    }

    let mut results = vec![];
    for _ in 0..2 {
        results.extend(rx.recv().unwrap());
    }
    results
}

IIUC this will require cloning any Datum that passes the filter predicate. If the predicate was |d| true it would double the memory consumed by my application.

Is it possible to make this operation zero copy? Can filter return a Vec<&Datum> which are references to Datums contained within Data?

(I’ve tried a few different approaches, but I always end up in spot where the references do not live long enough)

Thanks a lot, any help would be appreciated.


#2

I have yet to read the code and understand it but here is a library, at first, it sounds like it might help you: https://github.com/nikomatsakis/rayon


#3

Another nice library for data parallel processing is simple_parallel

Unfortunately it does not include a ready to use parallel filter.

Is it possible to make this operation zero copy? Can filter return a Vec<&Datum> which are references to Datums contained within Data?

Yes! The trick here is to ensure that a spawned thread does not outlive the filter function. For this, you can use a scope function from crossbeam. BTW, there is a dramatic and fascinating story behind the scope api.


#4

thread::spawn's signature in the docs will give you more info. The closure it takes is marked F: 'static which basically means that the closure must contain or capture no references, and the same is true for the return value (or any channels you capture). So thread::spawn simply disallows, using the rust type system, borrowed data to travel between threads.

If you’re curious, scoped threads (crate crossbeam) allows borrowing across threads, and I believe the previously mentioned rayon does too.