I'm looking for the best way to walk through a directory and perform operations on every file in the directory. In the best case, I'd like to be able to create a work queue and some kind of multithreaded stealing mechanisms that grab the content of the file once it becomes read.
Perhaps the jwalk crate might be what you're after. It leverages rayon to process directories in parallel when possible for a measurable speed boost over the non-parallel walkdir crate.
Thanks for the suggestion. Unfortunately I don't expect my directories to be very deep and according to their description, my use case won't benefit a lot of speed improvement:
This crates parallelism happens at the directory level. It will help when walking deep file systems with many directories. It wont help when reading a single directory with many files.
You can just throw work at rayon, e.g. using scope to wait for all of them.
Check if your OS and filesystem can even do stuff multi-threaded. For example directory scanning in macOS APFS is entirely single-threaded, so no amount of parallelization in Rust makes any difference for I/O, because it all hits a giant lock in Apple's filesystem.
So let me understand, what you're suggesting is to invoke the rayon::scope function, loop through all the files and instead of making any operations on the main thread, I should s.spawn(||..., which will move the work to a different thread. Is that correct?
If so is it safe for example, when hypothetically I have hundreds of files, to spawn a thread for each of them?
If not, could you clarify how exactly I should use the scope function?
I wonder how this is effecting jwalk and ignore on macOS 10.15? I'm using APFS and definitely see performance increase "some" the more threads that I use. Even in unsorted case that should be doing minimal processing besides read_dir the multithreaded versions are over 2x faster than single threaded walkdir.