Feedback on crate for parallel recursive directory walk

#1

I have an in progress recursive directory walk crate jwalk. It’s distinguishing features are:

  • Walk is performed in parallel using rayon
  • Results are streamed in sorted order

It is inspired by both walkdir and ignore. I attempts to match (or better) the performance of ignore::WalkParallel while also providing streamed sorted results like walkdir. Use cargo bench to see relative performance for walking linux repo. Thanks @BurntSushi!

It’s incomplete, missing many of the details provided by those crates. In particular it isn’t filtering results based on ignore rules. But I “think” the code is flexible enough to support these things cleanly.

I’m looking for feedback:

  1. Would anyone want to use this over what’s already available? Should I make the effort to complete and publish it.

  2. This is my first maybe public crate, I’d love for anyone to look through the code and tell me how to make it better.

  3. I’m particularly interested in getting best possible performance while using rayon. Performance is good right now on my computer. But I don’t really understand all the details in the rayon stacks when profiling… is utilization good or time being wasted somewhere?

1 Like
#2

I’m interested in faster scanning for dupe-krill duplicate remover.

I’ve prototyped my parallel scanner on macOS, and the results were utterly disappointing: CPU usage jumped from 1% to 70% (all spent in rayon), and overall I/O-bound speed did not change at all ;(

I don’t think that’s rayon’s fault. I suspect macOS’s filesystem has just a giant global lock that defeats parallelization. If you’ve seen speedups on macOS, I’d love to hear about them.

#3

I’m doing all my dev/testing on macOS and have seen improvements as compared with walkdir and ignore. I’ve only tested on SSD drives, maybe different and worse results when not SSD.

You should be able to run my benchmarks (on macOS at least) with cargo bench.

On my iMac Pro this crate is a bit over 2x faster then walkdir when walking linux git checkout for a simple walk.

But if I make the walk a bit harder… load metadata of each file and sort files by name in each directory, then this crate gets closer to 4x faster.

1 Like
#4

You aren’t the only one: Global Kernel Locks in APFS.

You two may want to ensure you are both talking about HFS+ or APFS.

1 Like
#5

I ran the benchmark. My system is Fedora 29

Dir_sorted_metadata                        
                        time:   [187.53 ms 187.95 ms 188.33 ms]

ignore::WalkParallel_sorted_metadata                                                                          
                        time:   [93.613 ms 95.382 ms 97.137 ms]

jwalk::WalkDir          time:   [25.458 ms 27.012 ms 29.082 ms]                          

jwalk::WalkDir_preload_metadata                                                                           
                        time:   [36.755 ms 38.577 ms 40.554 ms]

jwalk::WalkDir_first_1000                                                                           
                        time:   [4.9780 ms 5.1599 ms 5.3930 ms]
#6

My drive is formatted using APFS, maybe things improved recently? Here are my benchmark results. As you can see the parallel versions of both ignore and my jwalk definitely improve performance:

walkdir::WalkDir        time:   [160.29 ms 161.21 ms 161.80 ms]                           

walkdir::WalkDir_sorted_metadata                                                                          
                        time:   [421.46 ms 425.63 ms 430.05 ms]

ignore::WalkParallel_sorted_metadata                                                                          
                        time:   [179.43 ms 181.09 ms 181.98 ms]

jwalk::WalkDir          time:   [64.112 ms 64.351 ms 64.496 ms]                          

jwalk::WalkDir_preload_metadata                                                                          
                        time:   [106.82 ms 107.69 ms 108.56 ms]

jwalk::WalkDir_first_1000                                                                           
                        time:   [14.568 ms 14.637 ms 14.723 ms]
#7

I’ve cleaned the code up and added a lot of documentation since my initial post. It still doesn’t support all the features of ignore or walkdir. But it covers my use case well. I’m going to user it for a while myself, then will likely publish it.

Would be great to have some others using it before then to make sure it really is working.

Why would you use this crate?

Performance is the main reason you would choose this crate. The following benchmarks time a walk over linux’s source code under various conditions. You can run these benchmarks yourself using cargo bench.

Note in particular that jwalk is fast when you want streamed sorted results. Also note that even when used in single thread mode jwalk is close to walkdir in performance.

Also note that even though the ignore crate has similar performance to jwalk is has much worse latency when you want sorted results. jwalk will start streaming sorted results right away, while with ignore you’ll need to wait until the entire walk finishes before you can sort and start processing the results in sorted order.

Crate Options Time
jwalk unsorted, parallel 60.811 ms
jwalk sorted, parallel 61.445 ms
jwalk sorted, parallel, metadata 100.95 ms
jwalk unsorted, parallel (2 threads) 99.998 ms
jwalk unsorted, serial 168.68 ms
jwalk sorted, parallel, first 100 9.9794 ms
ignore unsorted, parallel 74.251 ms
ignore sorted, parallel 99.336 ms
ignore sorted, parallel, metadata 134.26 ms
walkdir unsorted 162.09 ms
walkdir sorted 200.09 ms
walkdir sorted, metadata 422.74 ms