Data structure for concurrency

I'm trying to write a program that will eventually run concurrently on multiple cores.
I have a dataset that consists of multiple files, and I'd like to parallelize loading the data.

So I defined a stuct like this

struct Dataset {
    items: Arc<Mutex<Vec<DataItem>>>,
}

That lets me load data in parallel.
Later on I'd also like to do other operations on my data in parallel. I quickly learned that I can't write a getter that returns a reference to a single DataItem when the definition above is used. After thinking about it a bit, that seems correct.

I can make that work if I change the struct definition to this:

struct Dataset {
    items: Arc<Mutex<Vec<Arc<Mutex<DataItem>>>>>,
}

That just looks insane!
Is there a better way to solve this than to bring out the Arc<Mutex<_>> hammer?

Well, you could clean it up a bit with:

type Shared<T> = Arc<Mutex<T>>;

struct SharedDataItem(Shared<DataItem>);

struct DataSet {
    items: Shared<Vec<SharedDataItem>>;
}

You could investigate refactoring to avoid Mutexes entirely and simply send data around with channels, with an actor pattern.

This structure is technically correct for shared mutable access. The only way to make it better is to avoid sharing.

For example, can you load the data into multiple independent Vecs? e.g.

struct Dataset {
   item_batches: Vec<Vec<DataItem>>
}

This way each (inner) Vec could be loaded by a separate thread, and you'd collect these at the end:

let item_batches = files.into_par_iter().map(load_file).collect();

Or if you want Vec<DataItem>, you can concatenate multiple independent Vecs into one - on a single thread, after all loader threads finish.

rayon has plenty of parallel iterator methods for this.

3 Likes

Thanks for the suggestions!

I think I'll go with the idea from @RedDocMD for now.
That way I can attach relevant manipulation methods to the wrapping struct and hide the locking from callers.