Learning rust: is this dir/file walker idiomatic?


#1

I’ve written my first rust program after ‘hello world’ and the grep example from the docs. I haven’t used the visitor pattern shown in the docs because I really do want a list of filenames. The code below works correctly, but it probably isn’t good idiomatic rust. Any comments on the rust style would be appreciated. (I don’t care about the algorithm as such.)

This is just the start for this program. My ultimate aim is to distribute slices of the list of files to as many threads as the machine has cores, and in each thread process all the files in the slice it is given. (The reason for keeping the file sizes is that I want to use this to help more evenly distribute the work.)

use std::env;
use std::fs::{self, DirEntry};
use std::path::Path;
use std::process;

struct FileInfo {
    filename: String,
    size: u64,
}

fn main() {
    let path = match env::args().skip(1).next() {
        Some(arg) => arg,
        None => {
            println!("path required");
            process::exit(1);
        },
    };
    let path = Path::new(&path);
    let mut files: Vec<FileInfo> = Vec::new();
    populate_files(path, &mut files, ".pdf");
    for file_info in files.iter() {
        println!("{} {}", file_info.filename, file_info.size);
    }
    println!("{} PDFs", files.len());
}

fn populate_files(dir: &Path, files: &mut Vec<FileInfo>, suffix: &str) {
    if dir.is_dir() {
        if let Ok(entries) = fs::read_dir(dir) {
            for entry in entries {
                if let Ok(entry) = entry {
                    if let Ok(file_type) = entry.file_type() {
                        if file_type.is_dir() {
                            populate_files(&entry.path(), files, suffix);
                        } else {
                            read_entry(&entry, files, suffix);
                        }
                    }
                }
            }
        }
    }
}

fn read_entry(entry: &DirEntry, files: &mut Vec<FileInfo>, suffix: &str) {
    if entry.path().to_string_lossy().to_lowercase().ends_with(suffix) {
        let path = entry.path();
        let mut size = 0;
        if let Ok(metadata) = entry.metadata() {
            size = metadata.len();
        }
        if let Some(filename) = path.to_str() {
            files.push(FileInfo{filename: filename.to_owned(),
                                size: size});
        }
    }
}

#2

There’s the walkdir crate, but that’s probably cheating.

A common trap you fall into when starting out is to use pattern matching everywhere, leading to excessive rightward drift (populate_files() is nested 6 or 7 levels down) and making code harder to understand. A lot of the time there’s a nice combinator (like Option::and_then() or Result::unwrap_or_default()) available which you will be much more expressive and readable than explicit pattern matching.

Rust also tends to be a fairly functional language, and as such you tend to have functions which do something and then return a result instead of a procedure which will mutate state. When applied on the large scale this really helps to make your program easier to reason about, plus because functions usually don’t have direct side-effects (they take some inputs and do stuff to them to make a bunch of outputs), it makes parallelizing the entire process quite trivial (check out rayon!).

The ? operator also makes it really nice to exit a function early if you hit an error, otherwise giving you the value if it was Ok. That may be fine for your use case given the most likely error will be permissions and not being able to read a directory’s contents, but if that’s the case it’s probably better for your particular program to bail instead of soldiering on because permission errors are a good hint that something is funny with your environment (e.g. executing from the wrong directory or running on the wrong files).

It’s also a bit of an antipattern to use a String when doing path operations. A string doesn’t really have any internal structure, whereas something like Path (or its owned version, PathBuf) can conceptually be thought of as an array of strings, where each element is a path segment. Using something more strongly typed like Path over String means you can skip all the string manipulations and also have access to a bunch of useful file-specific methods like checking if it exists or is a directory.

To give you an example of something more idiomatic, I made a quick example on the playpen which will generate a list of files in the directory (getting the size can be an exercise for the reader). It takes a pretty naive approach, but it’s pretty much the exact same thing I’d do in Python or any other language. You could also avoid the extra allocations by creating an iterator, but that may be overkill.


#3

Thanks for letting me know about walkdir (I had found the glob crate); for the real application I’ll use walkdir, but for learning I’ll keep going with doing it myself for now.

Thanks also for your find_files() function. I don’t see that using extend() is less mutating than push(), but clearly your design is much better than mine.

However, ISTM that your find_files() will return an Error if anything goes wrong at any time. (Or am I misunderstanding your code?)

What I’d like to get is a Vec<PathBuf> and ignore errors; or better still, something like Vec<Result<PathBuf, Error>>, so I can accumulate paths or errors as I go and then deal with them when I iterate over them later.

One minor point: the reason for using strings was to compare the file suffix since the code needs to work on Linux & case-insensitive Windows so I want to match the file suffix case-insensitively, but neither the PathBuf.extension() or .ends_with() methods seemed suitable.

Here’s an updated more functional version that tries to take on some of your ideas, but still uses Ok a lot:

use std::env;
use std::fs::{self, DirEntry};
use std::io::{Error, ErrorKind};
use std::path::{Path, PathBuf};
use std::process;

struct FileInfo {
    path: PathBuf,
    size: u64,
}

fn main() {
    let path = match env::args().skip(1).next() {
        Some(arg) => arg,
        None => {
            println!("path required");
            process::exit(1);
        },
    };
    let path = Path::new(&path);
    let files = acquire_files(path, ".pdf");
    for file_info in files.iter() {
        println!("{} {}", file_info.path.display(), file_info.size);
    }
    println!("{} PDFs", files.len());
}

fn acquire_files(root: &Path, suffix: &str) -> Vec<FileInfo> {
    let mut files = Vec::new();
    if root.is_dir() {
        if let Ok(entries) = fs::read_dir(root) {
            for entry in entries {
                if let Ok(entry) = entry {
                    if let Ok(file_type) = entry.file_type() {
                        if file_type.is_dir() {
                            files.extend(acquire_files(&entry.path(),
                                         suffix));
                        } else {
                            if let Ok(file_info) = read_entry(&entry,
                                                              suffix) {
                                files.push(file_info);
                            }
                        }
                    }
                }
            }
        }
    }
    files
}

fn read_entry(entry: &DirEntry, suffix: &str) -> Result<FileInfo, Error> {
    if entry.path().to_string_lossy().to_lowercase().ends_with(suffix) {
        let size = entry.metadata()?.len();
        return Ok(FileInfo{path: entry.path().to_path_buf(), size: size});
    }
    Err(Error::new(ErrorKind::Other, "suffix not matched"))
}

#4

I was more referring to how you pass in a &mut Vec as parameter and mutate that (returning nothing), instead of creating a completely new object and returning it to the user.

It’s a little more convoluted because filenames don’t necessarily need to be valid UTF-8 and Rust makes us take that into consideration, but what about something like this?

#[cfg(windows)]
fn matches(filename: &Path, ext: &str) -> bool {
  filename.extension()
    .and_then(|os_str| os_str.to_str())
    .map(|file_ext| file_ext.to_lowercase() == ext.to_lowercase())  
    .unwrap_or(false)
}

#[cfg(not(windows))]
fn matches(filename: &Path, ext: &str) -> bool {
  filename.extension()
    .and_then(|os_str| os_str.to_str())  // Option<OsStr> -> Option<&str>
    .map(|file_ext| file_ext == ext) // check 
    // either the file had no extension or it wasn't UTF-8, 
    // so we use a default of false
    .unwrap_or(false) 
}

There I’m using the various combinators on Option to apply a couple transforms before we can do the actual extension check.

This is where iterators make things really nice. For example, the WalkDir iterator from the walkdir crate will do the job of recursively walking a directory tree, and implements the Iterator trait, with each item being Result<DirEntry, Error>. You then use the various methods defined for Iterators to transform that into a list of paths only containing the things we want.

extern crate walkdir;

use std::path::{Path, PathBuf};
use walkdir::WalkDir;

fn find_files<P: AsRef<Path>>(root: P) -> Vec<PathBuf> {
    WalkDir::new(root)
        .into_iter()
        // if you want to keep errors, comment out this line and
        // update the subsequent operations to work with `Result` 
        // appropriately
        .filter_map(|entry: Result<DirEntry, Error>| entry.ok())  
        .filter(|entry: &DirEntry| entry.file_type().is_file())
        .map(|entry: DirEntry| entry.path().to_path_buf())
        // if you want, here's where you would `filter()` out anything
        // which doesn't have the desired file extension:
        // .filter(|path| matches(&path, "ext"))
        .collect()
}

(I’ve included the type at each step along the way for demonstration purposes, but type inference means none of it is actually necessary)

That example is more what I mean when I say “functional” programming. You essentially apply consecutive operations to your inputs, filtering out the things you don’t want and transforming them from one form to another with map. As a bonus, you’d be hard pressed to write more performant code in a procedural way because of how each operation is a zero cost abstraction (e.g. filter() and friends effectively compile down to a for loop + if statement).


#5

It’s also not overly difficult to implement your own version of WalkDir.


#6

Thanks again for your help!

At first I couldn’t get your matches() function to work but that was because I was looking for an extension that included the leading dot:-(

Now, using your matches() functions (renamed has_suffix()), and using the walkdir crate, I’ve got the essence of it down to this:

fn acquire_files(root: &Path, suffix: &str) -> Vec<FileInfo> {
    WalkDir::new(root)
        .into_iter()
        .filter_map(|entry| entry.ok())
        .filter_map(|entry| fileinfo_for_entry(&entry, suffix))
        .collect()
}

fn fileinfo_for_entry(entry: &walkdir::DirEntry, suffix: &str) -> Option<FileInfo> {
    if !entry.file_type().is_file() || !has_suffix(entry.path(), suffix) {
        return None
    }
    let size = entry.metadata().and_then(|metadata| Ok(metadata.len())).unwrap_or(0);
    Some(FileInfo{path: entry.path().to_path_buf(), size: size})
}

Your hand-made DirWalker is v. clever; too advanced for me at the moment! (I wasn’t aware of the rust playground; and have already found it useful.)

I’m finding rust’s learning curve quite steep. I’m up to 15.5 in the 2nd ed. of the manual, and hope to reach the end by Xmas. I’ve got a bunch of egs I want to port as part of learning & then a ‘real’ (small) app that I want to compare performance-wise with Python 3 & Go. (The Go version is much slower than Python + multiprocessing on Windows, so I’m hoping that Rust will be faster than both.)


#7

Sorry if I over-complicated things!

Basically an iterator is anything implementing the Iterator trait (i.e. you have a next() method, like in Python). What I’ve done is convert the recursive algorithm we were previously using into its iterative equivalent, storing all intermediate values inside the DirWalker struct. So the algorithm looks something like this:

  • while stack is not empty:
    • pop the next item
    • use dir_entry.read_dir() to get any items it may contain
    • append all those sub-items to the stack
    • yield the original item

So keeping that general algorithm in mind, lets look at the next() method from the previous example and pick it apart.

The DirWalker::new() constructor is just in charge of seeding that stack with some initial values. Because read_dir() returns an iterator over the items in a directory, we use collect() to collect those items into our initial stack (items).

    pub fn new(root: &Path) -> Result<DirWalker, Error> {
        let items = root.read_dir()?;

        Ok(DirWalker {
            items: items.collect(),
        })
    }

On every iteration we’ll call next() (line 23) and immediately pop a value from our stack of items (line 25). We use pattern matching and an if let to do one thing if we got a value, otherwise we jump to the else branch (line 40) and return None (i.e. the iterator has been exhausted).

    fn next(&mut self) -> Option<Self::Item> {
        // pop the next entry off the top of our `items` stack
        if let Some(next_item) = self.items.pop() {
          ...
        } else {
            None
        }
    }

Next we try to get any sub-items that this current item may contain, but because we’re keeping errors around (items is a Vec of Result<DirEntry, Error>) we check if we got a DirEntry (as opposed to an Error) and bind that to the entry variable (line 33).

            if let Ok(ref entry) = next_item {

The ref keyword on line 33 just says we want to take entry by reference instead of taking ownership of it. If we didn’t have this the compiler would complain on line 39 because we’re trying to use a moved variable.

On line 34 we call entry.path().read_dir() to get the list of sub-items, but because this may fail we need to use another if let because we only care about the happy case. Then we can finally add the sub-items to our stack (line 35) so we can check them later.

                if let Ok(sub_entries) = entry.path().read_dir() {
                    self.items.extend(sub_entries);
                }

Finally, we return the original item (line 39).

            Some(next_item)

Hopefully that’s a little easier to wrap your head around. Let me know if I’ve been ambiguous or confusing at any point, the learning curve for Rust is a bit steeper than other languages because it forces you to think about things like error handling and memory management up front instead of glossing over them only to blow up at runtime. A nice side-effect of this is that typically if your program compiles it’ll be correct and relatively bug-free (modulo logical bugs) :slightly_smiling_face:


#8

Thanks, I think I get it now!

I’ve ‘finished’ reading the manual (well, I skimmed the advanced stuff). Next I’m going to work on an example that will help me learn some of the collection types. Then I’ll be back to the dir walking, this time to do some actual processing, hopefully using the rayon library you pointed me to:-)