What's the fastest way to read a lot of files?

I'm looking for the best way to walk through a directory and perform operations on every file in the directory. In the best case, I'd like to be able to create a work queue and some kind of multithreaded stealing mechanisms that grab the content of the file once it becomes read.

What is the best way to approach this?

1 Like

Perhaps the jwalk crate might be what you're after. It leverages rayon to process directories in parallel when possible for a measurable speed boost over the non-parallel walkdir crate.

1 Like

Thanks for the suggestion. Unfortunately I don't expect my directories to be very deep and according to their description, my use case won't benefit a lot of speed improvement:

This crates parallelism happens at the directory level. It will help when walking deep file systems with many directories. It wont help when reading a single directory with many files.

You can just throw work at rayon, e.g. using scope to wait for all of them.

Check if your OS and filesystem can even do stuff multi-threaded. For example directory scanning in macOS APFS is entirely single-threaded, so no amount of parallelization in Rust makes any difference for I/O, because it all hits a giant lock in Apple's filesystem.

2 Likes

So let me understand, what you're suggesting is to invoke the rayon::scope function, loop through all the files and instead of making any operations on the main thread, I should s.spawn(||..., which will move the work to a different thread. Is that correct?

If so is it safe for example, when hypothetically I have hundreds of files, to spawn a thread for each of them?
If not, could you clarify how exactly I should use the scope function?

Rayon will not spawn a thread for every call to scope. It uses a thread pool.

3 Likes

Rayon is a work-stealing queue. Scope is just a layer on top of that which makes it easier to wait for all relevant tasks to finish.

1 Like

@alice @kornel Thanks.

Doing what ripgrep does is probably a good bet: https://blog.burntsushi.net/ripgrep/#gathering-files-to-search

1 Like

Here's how you can use jwalk for this:

jwalk::WalkDir::new(linux_dir())
    .parallelism(Parallelism::RayonNewPool(0))
    .into_iter()
    .par_bridge()
    .filter_map(|dir_entry_result| {
        let dir_entry = dir_entry_result.ok()?;
        if dir_entry.file_type().is_file() {
            let path = dir_entry.path();
            let text = std::fs::read_to_string(path).ok()?;
            if text.contains("hello world") {
                return Some(true);
            }
        }
        None
}).count();

I hope you can try that out and bench against a more custom solution and report back results. I'm curious! :slight_smile:

I wonder how this is effecting jwalk and ignore on macOS 10.15? I'm using APFS and definitely see performance increase "some" the more threads that I use. Even in unsorted case that should be doing minimal processing besides read_dir the multithreaded versions are over 2x faster than single threaded walkdir.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.