Testable way to iterate over a directory?

I have a function that recursively visits and calls a function on every leaf node in a directory, i.e. files and empty directories.

pub fn walk_leaf_paths<F>(
    path: &path::PathBuf,
    mut func: F
) -> io::Result<()>
where
    F: FnMut(path::PathBuf) -> io::Result<()>
{
    let mut dirs = Vec::new();
    let mut dir_index = 0;
    let mut dir_reader = fs::read_dir(path)?;
    let mut had_files = false;

    loop {
        match dir_reader.next() {
            Some(entry) => {
                let cur_path = entry?.path();
                had_files = true;

                if cur_path.is_dir() {
                    dirs.push(cur_path);
                    continue;
                }

                if cur_path.is_file() {
                    func(cur_path)?;
                    continue;
                }
            }
            _ => {
                if !had_files && !dirs.is_empty() {
                    // if `path` had no children, `dir_index` would
                    // already be 0
                    func(dirs[(dir_index - 1).max(0)].to_owned())?;
                }

                if dir_index == dirs.len() {
                    break;
                }

                dir_reader = dirs[dir_index].read_dir()?;
                had_files = false;
                dir_index += 1;
            }
        }
    }

    Ok(())
}

The problem is, the unit tests are fragile and require an actual folder structure to exist on disk. I know there's the WalkDir crate, but I want it to work with a second, made up type similar to DirEntry to enable a better testing experience. In tests I'd specify an entry like:

let entry = FakeDirEntry {
  path: std::path::PathBuf::from("a/b/c"),
  content: "some content"
}

Another goal is to see if the previous function can be converted to an iterator. If it can, some tests could use a Vec<FakeDirEntry> in them. The other option is explained further down.

When I think of how it could be made into an iterator, I can see having a loop to traverse the DirEntrys from fs::read_dir(), but then the iterator itself would require another loop to consume it. Whereas, the function above does all of that work in a single loop.

The other option I guess is to forget the iterator approach and add a children property on FakeDirEntry and add a second type similar to ReadDir to iterate over child nodes. That would also mean my tests would need nested Vecs. I started this approach, and it feels like a lot of wrapping for the purpose of testing; I'm not sure how it'll affect runtime performance.

Is there a way to achieve this with an iterator using only one loop instead of two?

Maybe there's a different approach I'm not thinking about?

The simplest option I know for mocking out the file system (and any other environment) is to use cfg to switch out the fs module for unit tests:

#[cfg(not(test))]
use std::fs;

#[cfg(test)]
mod fs {
  pub fn read_dir() ...
}

Note that integration tests (in the tests dir) don't use a cfg(test) build of your crate, so you can test with and without mocking.

I think if you are trying to test walking directories, you better test walking an actual directory, since even I/O errors can affect control flow, and as such, potentialy the walking logic itself. You can reliably get a sandbox directory nobody else will touch by creating a temporary directory using one of the suitable crates. You could even get fancy and try to reach out to isolation facilities of the OS, container-style, but IMO that's probably more paranoid than what is necessary for a test.

Of course. Just pull out the state represented by local variables into a struct and wrap the loop body in next(). You might even be able to remove the loop itself. However, I think your current solution is way more complicated than it needs to be. You could try this, and see how it fits with your custom filesystem types:

pub fn walk_leaf_paths(path: &Path) -> io::Result<LeafWalker> {
    let dirs = VecDeque::new();
    let dir_reader = path.read_dir()?;
    
    Ok(LeafWalker { dirs, dir_reader })
}

pub struct LeafWalker {
    dirs: VecDeque<PathBuf>,
    dir_reader: ReadDir,
}

impl Iterator for LeafWalker {
    type Item = io::Result<PathBuf>;
    
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            match try_opt!(self.dir_reader.next().transpose()) {
                Some(entry) => {
                    let ty = try_opt!(entry.file_type());
                    if ty.is_dir() {
                        self.dirs.push_front(entry.path());
                    }
                    if ty.is_file() {
                        return Some(Ok(entry.path()));
                    }
                }
                None => {
                    let next_dir = self.dirs.pop_back()?;
                    self.dir_reader = try_opt!(next_dir.read_dir());
                }
            }
        }
    }
}

4 Likes

If you're properly paranoid, you do both: unit tests that use mocks to inject io errors wherever they can be, and integration tests to check that it does actually work, but that's just regular testing guidance, nothing to do specifically with directories.

1 Like

For mocking the filesystem, you could have a look at:

You make good points about accounting for I/O errors in tests, and I do have a dedicated test using a real directory in my project for walking directories. The pain is in another function that calls walk_leaf_paths(), and that function using the callback to do part of its work. Now all of a sudden, the other function is dependent on I/O too.

If I didn't do this, I'd need one loop to get the paths, and another loop to analyze them (unless I can pass an iterator).

I also tried the playground you created, thanks for taking the time to do that. The issue is it doesn't return empty directories like my previous version does.

I tried adding return Some(Ok(next_dir)) to the top level None match case, but it prints the directory even when it has children. Also, I don't understand the transpose() you have, aren't you just undoing the transpose in try_opt!()?

Anyway, I created a playground with my old function to better explain the expected outcome.

Thanks for showing me this, I may need it. But also, I hope not.. Mocking always feels dirty..

I modified my code so that it does return empty directories while still avoiding to manually implement a queue via indexing the vector.

I am, but that doesn't change the fact that I want to try the inner Result and not the outer Option. The other two uses of try_opt() get (and require) a Result anyway, so it's nicer to be able to reuse try_opt! there at the cost of a transpose, instead of writing out the distracting match one more time.

Thanks! I'm a little slow, but after stepping through it in the debugger I finally wrapped my head around how you were catching empty_dir lol.

The benefits of it being an iterator I think outweigh the fact that it requires an additional loop (1 internal, 1 external). I was curious though, so I updated my original function to be as similar to your solution as possible and ran a benchmark with Criterion.

To my surprise, your iterator was faster than the function.

Callback: 80.797 µs, Iterator: 52.343 µs

It's a bit off-topic, but how can that be?

I found this in another topic about iterators sometimes being faster than loops, but I thought it'd only apply if the iterator body was basically the same as the loop body.

The handling of empty dirs is nothing particularly complicated, you just have to know your way around the standard library functions. I basically always start out by assuming that the directory is empty, and store its path. If it turns out not to be empty, then I set it to None and that's it. (The only special case that needs to be handled is the last iteration, because there I have to check it one more time instead of immediately returning None when the queue becomes empty.)

I don't know. You haven't posted the exact pieces of code you are comparing, and performance might depend on small, surprising details. It might be that the compiler is less able to optimize the callback because the generated code triggers some arbitrary threshold of code complexity where the optimizer heuristic just gives up. But that's speculation on my part. It might as well be that there's something funny going on in the filesystem (eg. caching). I highly doubt that physical directory access time doesn't completely trump any differences between the loops; file systems are much slower than even a virtual function call.

Still, pretty neat! Yeah there's been a couple times I was doing something and then realized or was told there's something in the standard library that could do the same thing.

Yeah I wasn't sure if it was ok to ask here, being not technically related to the original question.. I started to make a playground for my benchmark before realizing it's not supported, so I've added the code below. I have both the callback and the iterator count each time it hits a path.

The benchmark shows the iterator and callback don't count the same number of paths. In my project I have a separate unit test for both the iterator and the callback and they both hit the expected paths in a real directory. It's probably something I'm doing in the benchmark code.

Callback & Iterator benchmark
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use std::collections::VecDeque;
use std::path::{Path, PathBuf};
use std::io;
use std::fs::{self, create_dir_all, ReadDir};
use tempfile::TempDir;

macro_rules! try_opt {
    ($expr:expr) => {
        match $expr {
            Ok(value) => value,
            Err(error) => return Some(Err(error)),
        }
    }
}

pub fn walk_leaf_paths_iterator(path: &Path) -> io::Result<LeafWalker> {
    let dirs = VecDeque::new();
    let dir_reader = path.read_dir()?;
    
    Ok(LeafWalker { dirs, dir_reader, empty_dir: None })
}

pub struct LeafWalker {
    dirs: VecDeque<PathBuf>,
    dir_reader: ReadDir,
    empty_dir: Option<PathBuf>,
}

impl Iterator for LeafWalker {
    type Item = io::Result<PathBuf>;
    
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            match try_opt!(self.dir_reader.next().transpose()) {
                Some(entry) => {
                    self.empty_dir = None;

                    let ty = try_opt!(entry.file_type());
                    if ty.is_dir() {
                        self.dirs.push_front(entry.path());
                    }
                    if ty.is_file() {
                        return Some(Ok(entry.path()));
                    }
                }
                None => {
                    let next_dir = match self.dirs.pop_back() {
                         Some(leaf) => leaf,
                         None => return self.empty_dir.take().map(Ok),
                    };
                    self.dir_reader = try_opt!(next_dir.read_dir());

                    if let Some(leaf) = self.empty_dir.replace(next_dir) {
                        return Some(Ok(leaf));
                    }
                }
            }
        }
    }
}

pub fn walk_leaf_paths_callback<F>(path: PathBuf, mut func: F) -> io::Result<()>
where
    F: FnMut(PathBuf) -> io::Result<()>,
{
    let mut dirs = VecDeque::new();
    let mut dir_reader = path.read_dir()?;
    let mut empty_dir = None;

    loop {
        match dir_reader.next() {
            Some(entry) => {
                let cur_path = entry?.path();
                empty_dir = None;

                if cur_path.is_dir() {
                    dirs.push_front(cur_path);
                    continue;
                }
                if cur_path.is_file() {
                    func(cur_path)?;
                    continue;
                }
            }
            None => match dirs.pop_back() {
                Some(leaf) => {
                    dir_reader = leaf.read_dir()?;
                    if let Some(p) = empty_dir.replace(leaf) {
                        func(p)?;
                    }
                }
                None => {
                    if let Some(p) = empty_dir {
                        func(p)?;
                    }
                    break;
                }
            }
        }
    }

    Ok(())
}

fn bench_leaf_walkers(c: &mut Criterion) {
    // * Populate a temp dir

    let root = TempDir::new().unwrap();
    let root = root.path();
    
    fs::create_dir_all(root.join("a/foo")).unwrap();
    fs::create_dir_all(root.join("a/bar")).unwrap();
    fs::create_dir_all(root.join("b")).unwrap();
    fs::create_dir_all(root.join("c/qux/one")).unwrap();
    fs::create_dir_all(root.join("c/qux/two")).unwrap();
    fs::create_dir_all(root.join("d")).unwrap();
    fs::create_dir_all(root.join("e/f/g")).unwrap();
    
    fs::write(root.join("a/foo/file1.txt"), b"hello world").unwrap();
    fs::write(root.join("a/foo/file2.txt"), b"lorem ipsum").unwrap();
    fs::write(root.join("a/bar/file3.txt"), b"dolor sit amet").unwrap();
    fs::write(root.join("b/file4.txt"), b"consectetur").unwrap();
    fs::write(root.join("c/qux/two/file5.txt"), b"adespicit elit").unwrap();
    fs::write(root.join("c/qux/two/file6.txt"), b"anyone here").unwrap();
  
    // * Begin Bench

    let mut iter_leaf_count = 0;
    let mut callback_leaf_count = 0;

    let mut group = c.benchmark_group("Leaf Walk");
    group.bench_function("Callback", |b| {
        b.iter(|| walk_leaf_paths_callback(black_box(root.to_owned()), |_p| {
          Ok(callback_leaf_count += 1)
        }))
    });
    group.bench_function("Iterator", |b| {
        b.iter(|| {
            let walker = walk_leaf_paths_iterator(black_box(root.to_owned())).unwrap();
            for p in walker {
              iter_leaf_count += 1;
              // The callback version unwraps it internally,
              // so try to do the same work
              p.unwrap();
            }
        })
    });

    println!("\nCallback printed {} leaves.\nIter printed {} leaves.", callback_leaf_count, iter_leaf_count);

    group.finish();
}

criterion_group!(benches, bench_leaf_walkers);
criterion_main!(benches);

The code you posted doesn't compile, but I tried to fux it and it seems both methods hit the same number of leaves. RustExplorer supports tempfile as well as Criterion; you can try running the following code there:

/*
[dependencies]
tempfile = "*"
criterion = "*"
*/

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use std::collections::VecDeque;
use std::fs::{self, create_dir_all, ReadDir};
use std::io;
use std::path::{Path, PathBuf};
use tempfile::TempDir;

macro_rules! try_opt {
    ($expr:expr) => {
        match $expr {
            Ok(value) => value,
            Err(error) => return Some(Err(error)),
        }
    };
}

pub fn walk_leaf_paths_iterator(path: &Path) -> io::Result<LeafWalker> {
    let dirs = VecDeque::new();
    let dir_reader = path.read_dir()?;

    Ok(LeafWalker {
        dirs,
        dir_reader,
        empty_dir: None,
    })
}

pub struct LeafWalker {
    dirs: VecDeque<PathBuf>,
    dir_reader: ReadDir,
    empty_dir: Option<PathBuf>,
}

impl Iterator for LeafWalker {
    type Item = io::Result<PathBuf>;

    fn next(&mut self) -> Option<Self::Item> {
        loop {
            match try_opt!(self.dir_reader.next().transpose()) {
                Some(entry) => {
                    self.empty_dir = None;

                    let ty = try_opt!(entry.file_type());
                    if ty.is_dir() {
                        self.dirs.push_front(entry.path());
                    }
                    if ty.is_file() {
                        return Some(Ok(entry.path()));
                    }
                }
                None => {
                    let next_dir = match self.dirs.pop_back() {
                        Some(leaf) => leaf,
                        None => return self.empty_dir.take().map(Ok),
                    };
                    self.dir_reader = try_opt!(next_dir.read_dir());

                    if let Some(leaf) = self.empty_dir.replace(next_dir) {
                        return Some(Ok(leaf));
                    }
                }
            }
        }
    }
}

pub fn walk_leaf_paths_callback<F>(path: &Path, mut func: F) -> io::Result<()>
where
    F: FnMut(PathBuf) -> io::Result<()>,
{
    let mut dirs = VecDeque::new();
    let mut dir_reader = path.read_dir()?;
    let mut empty_dir = None;

    loop {
        match dir_reader.next() {
            Some(entry) => {
                let cur_path = entry?.path();
                empty_dir = None;

                if cur_path.is_dir() {
                    dirs.push_front(cur_path);
                    continue;
                }
                if cur_path.is_file() {
                    func(cur_path)?;
                    continue;
                }
            }
            None => match dirs.pop_back() {
                Some(leaf) => {
                    dir_reader = leaf.read_dir()?;
                    if let Some(p) = empty_dir.replace(leaf) {
                        func(p)?;
                    }
                }
                None => {
                    if let Some(p) = empty_dir {
                        func(p)?;
                    }
                    break;
                }
            },
        }
    }

    Ok(())
}

fn bench_leaf_walkers(c: &mut Criterion) {
    // * Populate a temp dir

    let root = TempDir::new().unwrap();
    let root = root.path();

    fs::create_dir_all(root.join("a/foo")).unwrap();
    fs::create_dir_all(root.join("a/bar")).unwrap();
    fs::create_dir_all(root.join("b")).unwrap();
    fs::create_dir_all(root.join("c/qux/one")).unwrap();
    fs::create_dir_all(root.join("c/qux/two")).unwrap();
    fs::create_dir_all(root.join("d")).unwrap();
    fs::create_dir_all(root.join("e/f/g")).unwrap();

    fs::write(root.join("a/foo/file1.txt"), b"hello world").unwrap();
    fs::write(root.join("a/foo/file2.txt"), b"lorem ipsum").unwrap();
    fs::write(root.join("a/bar/file3.txt"), b"dolor sit amet").unwrap();
    fs::write(root.join("b/file4.txt"), b"consectetur").unwrap();
    fs::write(root.join("c/qux/two/file5.txt"), b"adespicit elit").unwrap();
    fs::write(root.join("c/qux/two/file6.txt"), b"anyone here").unwrap();

    // * Begin Bench

    let mut iter_leaf_count = 0;
    let mut callback_leaf_count = 0;

    let mut group = c.benchmark_group("Leaf Walk");
    group.bench_function("Callback", |b| {
        b.iter(|| walk_leaf_paths_callback(black_box(root), |_p| Ok(callback_leaf_count += 1)))
    });
    group.bench_function("Iterator", |b| {
        b.iter(|| {
            let walker = walk_leaf_paths_iterator(black_box(root)).unwrap();
            for p in walker {
                iter_leaf_count += 1;
                // The callback version unwraps it internally,
                // so try to do the same work
                p.unwrap();
            }
        })
    });

    println!(
        "\nCallback printed {} leaves.\nIter printed {} leaves.",
        callback_leaf_count, iter_leaf_count
    );

    group.finish();
}

criterion_group!(benches, bench_leaf_walkers);
criterion_main!(benches);

Ah, sorry about that. In the code I posted, walk_leaf_paths() should've been walk_leaf_paths_callback() and Leafwalker::new() should've been walk_leaf_paths_iterator(), but you found those.

I think there's a discrepancy between rustexplorer.com and my local project. I ran your code and mine in each environment. They both count 9 paths in rustexplorer.com, but in my local project they both count hundreds of thousands. The cargo bench output also prints more verbose output for me. Maybe rustexplorer's not calling cargo bench?

The code is supposed to count 9 leaves. If it counts hundreds of thousands, then it's not looking inside the freshly-created temporary directory, but perhaps in the system root or some other place with many transitive child paths.

Given the benchmark code, any idea why that would happen if the temp directory is created and its path is passed to both functions?

Update
I set the respective count variables to 0 inside the callback to b.iter() inside each bench_function(), and that gave the expected count.

I'm not sure how Criterion works or the default settings, but my guess is that the counters were incrementing not just for each visited path, but for each path for x iterations over y "samples". Makes sense now that I think about it.