Multi-threaded unquantifiable tasks

Hi there!

I'm currently writing a tool that I'll call 'rsync-like' just so my case is clearer (it's not really like rsync but it's easier to understand that way).

Concretely, I have a function which lists all the items in a SFTP directory (which takes time, as the command must be sent to the server, computed and then retrieved). This is extremely slow on one thread, so I'd like to multi-thread it.

The problem is that I don't really know how to handle it. When the function gets called, it only treats one folder. Then while iterating it discovers other ones, so it calls itself recursively with these ones' path, and so on.

I've tried using a system where I spawn a thread for each sub-directory, and at the end of the function all same-level items are added to a Mutex<Vec<ItemData>>. Problem, this is even slower, as I have to spawn a new thread for each directory in the SFTP storage. It would probably be slow even with a task execution like Tokio.

Do you have any idea on how to do this? It's surely possible as many tools do it (RClone or FreeFileSync to name a few) but I have no idea how to do it...

Thanks in advance for your help!

Broadly the answer to how this kind of threading problem is solved is: Create a message queue/channel (crossbeam has some types meant for this kind of thing, though I haven't tried using it this way). When you discover new work, put it on the queue. Each worker thread takes items from the queue when possible. Instead of a recursive (depth-first) algorithm using the stack, everything goes on the queue (and will tend to execute breadth-first).

It might also be even easier to use rayon's thread pool scheduling, starting with rayon::scope. That way you don't have an explicit message queue; just “please run this closure when there's some capacity to do so”.

However, since this is an IO-heavy problem you should probably start with the assumption that using tokio will work better than engineering your own threads. It might even be the case that you want multiple tasks using multiple connections to the server, but do not need multiple threads at all (because they are spending most of their time waiting for network responses, not using the CPU).

You should also be thinking about how you manage your connections to the server — probably opening a small number of connections (independent of the number of work items), like HTTP clients do, if you are not doing that already.

2 Likes