WalkDir performance - a small experiment

Directory traversal in rust is easy. It's made even easier by the usually recommended crate walkdir.

But is this crate fast enough ?

I wrote a small thing yesterday on this topic: Canop/walk_pace

I'm new to rust so I might have done something wrong, I'd really like to hear about this from seasoned rusticians.

It's an interesting benchmark, but they aren't measuring the same thing. walkdir does depth first iteration, while your program does breadth first iteration. The latter is generally going to use more memory, since it requires memory proportional to the fattest directory where as the former only requires memory proportional to the deepest directory. How exactly that impacts performance isn't clear.

In general, walkdir should be compared against things like find and ftw. Those are roughly operating in the same space with a similar set of configurable features.

Also, walkdir does error handling, and in particular, yields elements from its iterator that could not be read. When an error occurs, the file path is copied for convenient error reporting. Generally, this isn't an issue in practice because you typically want to show those users the error message, so it isn't wasted work. Your benchmark doesn't quite account for this though. And in particular, a benchmark over / is likely to encounter lots of errors. (Look at the stderr output of find / to get an idea.)

Unless you understand the trade offs walkdir is making for you in directory traversal, I generally would not recommend rolling your own. At some point, your users are going to want things like "follow symlinks," or "stay on same file system" or "don't exceed some maximum depth" or "sort my entries." All of those things are provided by walkdir out of the box.

What I meant isn't that walkdir isn't useful, especially in the specific case you want depth first descent, but that the cost which is associated to its use in place of a specific iteration shouldn't be overlooked.

I'm interested in fast directory scanning (and also frequent reading of files in each dir) for my duplicate finder.

I've tried making the program multi-threaded, and found that it made it much much slower on macOS.

For faster scanning I've used yet another approach: I traverse directories with lowest inode numbers first (i.e. I throw them all in a BinaryHeap indexed by inode number and pop from there).
In my backup HDDs inodes are monotonically increasing, and generally files with similar inodes are stored physically close together, so traversing in inode order reduces seeks.

2 Likes

Oh this is interesting. I'll try this approach in broot (not for the main BFS traversals but for size computing), especially as I already have to take inodes into account in order to avoid counting them more than once).

If you need to walk an entire partition directory, you can parse the file systems record for all the files. I’m not familiar with the details of how to do it but there are a couple utilities I use on Windows (Everything and Wiztree) and I’m fairly certain they both parse the NTFS Master File Table which seemingly is much faster than the equivalent tools that actually walk the entire tree. With some quick googling, it looks like there may be similar data structures for every file system.

Only if you have admin permissions on the machine. Regular user accounts aren't allowed to access the raw file table, because that would allow you to bypass the filesystem permissions (seeing the names of files in a directory that you don't have read access to).

1 Like