WalkDir performance - a small experiment

Canop · January 12, 2019, 7:22am

Directory traversal in rust is easy. It's made even easier by the usually recommended crate walkdir.

But is this crate fast enough ?

I wrote a small thing yesterday on this topic: Canop/walk_pace

I'm new to rust so I might have done something wrong, I'd really like to hear about this from seasoned rusticians.

BurntSushi · January 12, 2019, 1:44pm

It's an interesting benchmark, but they aren't measuring the same thing. walkdir does depth first iteration, while your program does breadth first iteration. The latter is generally going to use more memory, since it requires memory proportional to the fattest directory where as the former only requires memory proportional to the deepest directory. How exactly that impacts performance isn't clear.

In general, walkdir should be compared against things like find and ftw. Those are roughly operating in the same space with a similar set of configurable features.

Also, walkdir does error handling, and in particular, yields elements from its iterator that could not be read. When an error occurs, the file path is copied for convenient error reporting. Generally, this isn't an issue in practice because you typically want to show those users the error message, so it isn't wasted work. Your benchmark doesn't quite account for this though. And in particular, a benchmark over / is likely to encounter lots of errors. (Look at the stderr output of find / to get an idea.)

Unless you understand the trade offs walkdir is making for you in directory traversal, I generally would not recommend rolling your own. At some point, your users are going to want things like "follow symlinks," or "stay on same file system" or "don't exceed some maximum depth" or "sort my entries." All of those things are provided by walkdir out of the box.

Canop · January 12, 2019, 2:01pm

What I meant isn't that walkdir isn't useful, especially in the specific case you want depth first descent, but that the cost which is associated to its use in place of a specific iteration shouldn't be overlooked.

kornel · January 12, 2019, 2:14pm

I'm interested in fast directory scanning (and also frequent reading of files in each dir) for my duplicate finder.

I've tried making the program multi-threaded, and found that it made it much much slower on macOS.

For faster scanning I've used yet another approach: I traverse directories with lowest inode numbers first (i.e. I throw them all in a BinaryHeap indexed by inode number and pop from there).
In my backup HDDs inodes are monotonically increasing, and generally files with similar inodes are stored physically close together, so traversing in inode order reduces seeks.

Canop · January 12, 2019, 2:15pm

Oh this is interesting. I'll try this approach in broot (not for the main BFS traversals but for size computing), especially as I already have to take inodes into account in order to avoid counting them more than once).

drewkett · January 12, 2019, 4:02pm

If you need to walk an entire partition directory, you can parse the file systems record for all the files. I’m not familiar with the details of how to do it but there are a couple utilities I use on Windows (Everything and Wiztree) and I’m fairly certain they both parse the NTFS Master File Table which seemingly is much faster than the equivalent tools that actually walk the entire tree. With some quick googling, it looks like there may be similar data structures for every file system.

notriddle · January 12, 2019, 7:56pm

Only if you have admin permissions on the machine. Regular user accounts aren't allowed to access the raw file table, because that would allow you to bypass the filesystem permissions (seeing the names of files in a directory that you don't have read access to).

Topic		Replies	Views
What's the fastest way to read a lot of files? help	11	4250	June 18, 2020
Why is RUST traversing directories slower than node code review	21	1332	December 6, 2022
Is the WalkDir crate fine with the file system changing while running? help	2	267	May 24, 2023
Program for searching a list of directories for a given filename code review	2	706	October 7, 2022
The tin summer - a replacement for 'du' announcements	6	1523	January 12, 2023

WalkDir performance - a small experiment

Related Topics