Parrallel IO data processing fork-join thread::spawn or rayon or other?

hello everyone
if I have a data-IO-bound job that is a great use case for fork-join what is the best crate for this task?
I am aware of rayon and crossbeam and I can also write my own thread::spawn and handle.join().
I am sure there may be other too.

The processing use case is many tens of thousands of files, concurrent database (SQL) queries, concurrent big-data streaming, i.e. parquet, CSV, json files, etc, all that can be easily partitioned into multiple threads and then thread results can be aggregated once all threads have finished.

I read that rayon recommended to use it when the job is mostly CPU bound.
Should I white my own code, using Rust standard libraries like thread::spawn() and handle.join() or use a crate like rayon?

thank you

Use Rayon when you find yourself spawning and destroying many CPU-bound threads. Rayon essentially just makes spawning threads cheaper when they're CPU-bound.

Rayon works brilliantly when you have a big data set and want to do significant processing over elements of of it that can be done independently.

But your requirement emphasises lot of I/O. Reading (writing) thousands of files and waiting on database queries. I suspect that rayon may not be optimal for that. Perhaps just asynchronous threads with Tokyo would be better.

It has been said that synchronous parallel threads are great for when you have computation to do. Asynchronous threads are great for when you have a lot of waiting to do (like file reads and database requests).

I guess you have to figure out what balance of computer work vs I/O you have on your plate.

The main reason that rayon is poor for IO-bound tasks is that a thread waiting for IO does not consume a full CPU-core of computation, so since rayon spawns one thread per CPU-core, putting IO on one of the threads means that you aren't using all of your CPU-cores.

Because of this, you typically want to put IO either on a very large thread pool, or on a single thread using something like epoll to perform many IO operations concurrently on one thread.

Generally if you have both a lot of IO and a lot of CPU-bound computation, then it's best to split them up so that IO-bound happens on Tokio and CPU-bound happens on rayon. If you have a lot of file reading, you might want to try out tokio-uring for your file reading.

2 Likes

thank you everyone,

I will experiment with the options that you have suggested on this thread (pardon the pun!!)

I've switched https://lib.rs from using rayon for a mix of cpu and io tasks, and use tokio for everything (with spawn_blocking and leftovers of block_in_place from the rewrite). It works fine.

3 Likes

One click (expected to be quick, but devolved into a deep dive : ) and I'm converted to lib.rs! Nice site!

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.