Implementing GLOB in Rust

Recently I migrated to Rust from C. I still struggle to gain a good performance and my code looks ugly.

use std::{env::current_dir, fs::ReadDir, path::PathBuf};
struct Glob {
    parent: Option<PathBuf>,
    dir: Option<ReadDir>,
    before: String,
    after: String,
}
impl Glob {
    fn from(str: &str) -> Self {
        let mut parent = PathBuf::from(str);
        if let Some(file_name) = parent.file_name()
            && let file_name = file_name.display().to_string()
            && let Some((before, after)) = file_name.split_once('*')
        {
            parent.pop();
            let dir = if parent.has_root() {
                parent.read_dir()
            } else {
                current_dir().unwrap_or_default().join(&parent).read_dir()
            }
            .unwrap();
            let before = before.to_string();
            let after = after.to_string();
            Glob {
                parent: Some(parent),
                dir: Some(dir),
                before,
                after,
            }
        } else {
            Glob {
                parent: Some(parent),
                dir: None,
                before: String::new(),
                after: String::new(),
            }
        }
    }
}

impl Iterator for Glob {
    type Item = String;

    fn next(&mut self) -> Option<Self::Item> {
        if let Some(dir) = &mut self.dir {
            let pattern_len = self.before.len() + self.after.len();
            loop {
                match dir.next() {
                    None => break None,
                    Some(entry) => {
                        if let Ok(entry) = entry {
                            let file_name = entry.file_name().display().to_string();
                            if file_name.len() > pattern_len
                                && file_name.starts_with(&self.before)
                                && file_name.ends_with(&self.after)
                            {
                                if let Some(parent) = &self.parent {
                                    let mut parent = parent.clone();
                                    parent.push(entry.file_name());
                                    break Some(parent.display().to_string());
                                } else {
                                    break Some(file_name);
                                }
                            } else {
                                continue;
                            }
                        }
                    }
                }
            }
        } else if let Some(parent) = &self.parent {
            let res = parent.display().to_string();
            self.parent = None;
            Some(res)
        } else {
            None
        }
    }
}
fn main() {
    for item in Glob::from("*") {
        println!("{item}")
    }
}

What are basic code improvement tips you could give?

if I read your code correctly, this program only support single wildcard and it must be in the last component of the path (i.e. the file name part)?

in other words, patterns like these does not work properly, right?

  • ./session-20260401*/error.log
  • /home/*/.rustup/toolchains/*,
  • /tmp/systemd-*/

also, if the pattern does not contain a wildcard, your program simply returns the pattern verbatim without touching the filesystem, no matter if the path actually exists or not, am I understanding this behavior right?

I don't know how you compare the performance, this program is very likely io bound. I think most of the time is spent in the readdir libc function or system call, the user space code should not be affect the overall run time very much.

some remarks:

  • you are constantly doing str and OsStr conversions (because Path is wrapper over OsStr) and utf8 validations. although I don't think it has a huge performance penalty, still, it is not zero.
  • I think the loop inside the Iterator::next() can be replaced with Iterator::find() or Iterator::find_map().
  • actually, large part of the Iterator::next() logic is just the Iterator::filter_map() abstraction

so here's my attempt to replicate the functionality of the original code. I don't know how the performance compares, but in terms of complexity, it is way simpler, at least to me subjectively. but hey, I'm biased because I wrote this code. playground

use std::fs::read_dir;
use std::path::Path;
use std::path::PathBuf;

/// hidden type for the returned `impl Iterator`
/// need this enum for the special case of unsupported patterns
enum Either<A, B> {
	A(A),
	B(B),
}

/// boilterplates
impl<A, B, T> Iterator for Either<A, B>
where
	A: Iterator<Item = T>,
	B: Iterator<Item = T>,
{
	type Item = T;

	fn next(&mut self) -> Option<Self::Item> {
		match self {
			Either::A(inner) => inner.next(),
			Either::B(inner) => inner.next(),
		}
	}
}

fn glob(pattern: &str) -> impl Iterator<Item = PathBuf> {
	// if no wildcard, no need to touch the filesystem
	let Some((before, after)) = pattern.split_once('*') else {
		return Either::A(std::iter::once(PathBuf::from(pattern)));
	};

	// wildcard not in the last segment, not supported, bail
	if after.contains('*') {
		return Either::A(std::iter::once(PathBuf::from(pattern)));
	}

	// use `split_at()` so that `parent` has a trailing `/`.
	// alternatively, `rsplit_once('/')` can be used,
	// but a pattern starting with `/*` become a special case
	let (read_dir, parent, before) = match before.rfind('/').map(|i| before.split_at(i + 1)) {
		Some((parent, before)) => (read_dir(parent), Path::new(parent), before),
		None => (read_dir("."), Path::new(""), before),
	};

	// failed to read parent directory, bail
	let Ok(read_dir) = read_dir else {
		return Either::A(std::iter::once(PathBuf::from(pattern)));
	};

	let min_len = before.len() + after.len();

	Either::B(read_dir.filter_map(move |entry| {
		entry.ok().and_then(|entry| {
			let file_name = entry.file_name();
			if file_name.len() >= min_len
				&& file_name.as_encoded_bytes().starts_with(before.as_bytes())
				&& file_name.as_encoded_bytes().ends_with(after.as_bytes())
			{
				Some(parent.join(file_name))
			} else {
				None
			}
		})
	}))
}

major differences compared to the original:

  • the iterator yields PathBuf instead of String, the cost of utf8 conversion shifts to the caller. don't pay for what you don't use.
  • the prefix/suffix/length check is done in raw bytes (OsStr::as_encoded_bytes()), no utf8 validation is performed for intermediate values
  • use an opaque impl Iterator instead of a named type, it's impossible to name the real type because closures are used (for filter_map(), flat_map(), etc).
  • hardcoded unix path separator /
    • original code uses standard library path operations like PathBuf::pop(), Path::has_root(), current_dir(), etc, which I assume is meant to support windows paths too, but I didn't bother

Right. It would be a cool feature, but not for my multi purpose adapter.

Correct, this adapter makes only a guess that it should be a real file path, but it could be something else.

You are right, so I changed to OsString to eliminate exceeding type conversions.

It isn't a big deal, although you are right, the code targets mostly Windows platform, because Unix already has the code as a part of shell implementation. Anyway, you can release the hard coding and use MAIN_SEPARATOR instead.

Thank you for the valuable input and your version of the implementation, it's sleek and cool.

I too will assume you want zero deps. But instead of presenting a packaged result, I'll walk through making some changes, to illustrate the process of spotting and making iterative improvements. I mostly[1] try to preserve current behavior -- so the result isn't necessarily the "best solution" so much as just cleaner code (IMO).


If we look at the constructor and iterator implementation, we can see that either

  • The constructor sets parent and dir to Some
    • In which case the iterator never sets either to None[2]
  • Or the constructor sets parent to Some and dir to None
    • In which case the iterator returns one item[3]

So an easy first change in next is to get rid of the if-else chain and early return if dir is None, getting rid of one nesting level.

    fn next(&mut self) -> Option<Self::Item> {
        let Some(dir) = &mut self.dir else {
            return self.parent.take().map(|p| p.display().to_string());
        };
    
        let pattern_len = self.before.len() + self.after.len();
        // ...

Next, this is basically a for loop with a filter:

        loop {
            match dir.next() {
                None => break None,
                Some(entry) => {
                    if let Ok(entry) = entry { ... } // no `else`

So we can use for to remove multiple levels of nesting.

        for entry in dir.filter_map(|e| e.ok()) {
            let file_name = entry.file_name().display().to_string();
            // ... no `continue`,  `break` becomes `return`, `None` after `for`

And get rid of one more if-else by utilizing the logical invariant that parent is Some if we can reach the for loop.

        let Some(parent) = &self.parent else { unreachable!() };
        for entry in dir.filter_map(|e| e.ok()) {

We've only really looked at flow and I think this is already a significant improvement.


Now thinking about the logic that's left in next, this is basically another map (we only use entry.file_name()) and filter. It's easy to only call file_name() once, which is an improvement. But (given the restrictions of the constructor) we can also avoid converting to String to do the check: the only way the file name can start with begin is if its encoded bytes starts with the bytes of begin, and similarly for end.[4]

Let's make this a standalone function as well.

fn matches(before: &str, after: &str, file_name: &OsStr) -> bool {
    let bytes = file_name.as_encoded_bytes();
    // You had `>` but I think that was just a bug
    bytes.len() >= before.len() + after.len()
        && bytes.starts_with(before.as_bytes())
        && bytes.ends_with(after.as_bytes())
}

At this point, I think we might as well return PathBuf too, as we no longer need to convert to String or str for the comparison.

impl Iterator for Glob {
    type Item = PathBuf;
    fn next(&mut self) -> Option<Self::Item> {
        let Some(dir) = &mut self.dir else {
            return self.parent.take();
        };

        let Some(parent) = &self.parent else { unreachable!() };
        dir.filter_map(|e| e.map(|e| e.file_name()).ok())
            .filter(|file| matches(&self.before, &self.after, &file))
            .map(|file_name| parent.join(file_name))
            .next()
    }
}

We haven't try to improve the constructor yet, so let's do that now. For starters, we know all the components are valid UTF-8 as our input was a &str, and we can work with &Path instead of PathBuf and String in many cases.

        let parent: &Path = str.as_ref();
        if let Some(file_name) = parent.file_name()
            && let file_name = file_name.to_str().unwrap()
            && let Some((before, after)) = file_name.split_once('*')
        {
            let parent = parent.parent().unwrap_or(parent);

I know I said I would try to preserve behavior, but if we think about how you call read_dir...

Sidebar about unwrap_or_default and join
            let dir = if parent.has_root() {
                parent.read_dir()
            } else {
                current_dir().unwrap_or_default().join(parent).read_dir()
            }
            .unwrap();

If current_dir() fails, IMO you should also just fail -- something's probably pretty wrong for that to happen (and trying to read the current directory will probably fail too). There is also a subtle inconsistency here: if parent is empty, pushing it onto the default will leave you empty (""). But probably you wanted to fall back to ".". Or you could just use "." to begin with (which is load bearing if parent is empty).

So I think these are better.

            let dir = if parent.has_root() {
                parent.read_dir()
            } else {
                current_dir().unwrap().join(parent).read_dir()
            }
            .unwrap();
            let dir = if parent.has_root() {
                parent.read_dir()
            } else {
                Path::new(".").join(parent).read_dir()
            }
            .unwrap();
            let dir = match parent.as_os_str().is_empty() {
                true => Path::new(".").read_dir(),
                false => parent.read_dir(),
            }.unwrap();

Also while we're here, we can clean up a little more by making our two use cases explicitly separated (so we don't need unreachable!() in next for example).

struct GlobDir {
    dir: ReadDir,
    parent: PathBuf,
    before: String,
    after: String,
}

impl Iterator for GlobDir { ... }

enum GlobInner {
    Parent(Option<PathBuf>),
    GlobDir(GlobDir),
}

pub struct Glob {
    glob: GlobInner,
}

impl Iterator for Glob {
    type Item = PathBuf;
    fn next(&mut self) -> Option<Self::Item> {
        match &mut self.glob {
            GlobInner::Parent(parent) => parent.take(),
            GlobInner::GlobDir(gd) => gd.next(),
        }
    }
}

That's another good stopping point. Finally, note how we call to_owned() on everything but dir in the constructors in from. All those borrows came from the input &str. So if you don't mind the input &str remaining borrowed by the returned iterator, you can also get rid of those allocations. (With more boilerplate you could support both use cases.)


In the collapsed section, I talked about some of the subtleties around working with the std Path[Buf] types/methods. There's a lot more that could be said about this, and corner cases when working with parent, pop, file_name, and friends -- what they do does not always line up one-to-one. For instance a push followed by a pop can result in a different path (semantically, not just textually).

Not wanting to get into those weeds was part of why I (mostly) ignored potentially unwanted behavior.

One odd case you have that could be easily handled is an empty input &str.

Another alternative is to make from fallible (return Result).


  1. but not entirely ↩︎

  2. and break Some(file_name) is logically dead code ↩︎

  3. parent.display().to_string() ↩︎

  4. Technically this is a change in behavior -- for the better, as we're now comparing to the actual file name, and not its display() representation. ↩︎

Rust programmers are like a magician who pulled some magic grinder from the sleeve, which transforms mountains of useless C style code in a couple line of functional Rust blend.