Regex, RegexSet or ends_with

I have a list of file endings and want to check if a filename given as &str
matches any of those file endings.

I see three possibilities:

Please note that in reality I have around 30 such file endings.

  1. loop over those file endings and check till found
pub fn is_file_ending (entry: &str) -> bool {
    for (e, _) in vec![ ".x", ".y", ] {
        if entry.ends_with(e) {
            return true;
        }
    }

    false
}
  1. creating a regexp and check if matches
pub fn is_file_ending(entry: &str) -> bool {
    lazy_static! {
        static ref RE: Regex = r"(\.x|\.y)$";
    }

    RE.is_match(entry)
}
  1. creating a RegexSet and check if matches
pub fn is_file_ending(entry: &str) -> bool {
    lazy_static! {
        static ref RE: RegexSet = RegexSet::new(&[
            r"\.x$,",
            r"\.y$,",
        ]).unwrap();
    }

    RE.is_match(entry)
}

Which one should I use for best performance? I would assume that here 3. is the worst method to use.
But I am not sure if 1. or 2. is better.

I'd recommend writing a small program that runs each option in a tight loop 10, 100, 1000, and 10000 times each and times it and outputs the results. Compile it in both Debug and Release Mode with/without Thin-LTO and/or LTO and compare the results.

EDIT: If it matters enough to ask the question which is more efficient, it is probably worth taking the time to bench-mark. If it isn't worth the time to bench-mark, it probably isn't worth the time to ask the question. That's the way I always treat these things anyway.

1 Like

Yes, benchmark it. In particular, this is going to be strongly dependent on the number of suffixes you're checking and how big your haystacks are. In the specific example you have here, I'd actually expect (1) to be faster than either (2) or (3), since your data sizes are so small. But at some point, (2)/(3) should do better than (1). Where that crossover point is can only be found by benchmarking.

I would generally expect (2) and (3) to perform the same, but I've been hilariously wrong about such things before!

1 Like

As I said in my first post I have around 30 suffixes. But double checking it I see there are 86 suffixes.

I did a benchmark using the bencher crate.

The stupid case with 1 file ending pattern:

test bench_endswith ... bench:          42 ns/iter (+/- 6)
test bench_regex    ... bench:         234 ns/iter (+/- 18)
test bench_regexset ... bench:         233 ns/iter (+/- 20)

8 patterns

test bench_endswith ... bench:         488 ns/iter (+/- 69)
test bench_regex    ... bench:         562 ns/iter (+/- 27)
test bench_regexset ... bench:       2,243 ns/iter (+/- 189)

9 patterns

test bench_endswith ... bench:         527 ns/iter (+/- 64)
test bench_regex    ... bench:         244 ns/iter (+/- 17)
test bench_regexset ... bench:       2,335 ns/iter (+/- 232)

Around 85 file ending patterns:

test bench_endswith ... bench:       3,760 ns/iter (+/- 458)
test bench_regex    ... bench:         252 ns/iter (+/- 27)
test bench_regexset ... bench:      14,946 ns/iter (+/- 2,187)

It is very interesting to see that Regex is pretty stable.

I hope that (as a Rust beginner) I didn't do it too badly. Here the link: Manfred Lotz / benchsearch · GitLab

1 Like

What about extracting the extension and looking up in a set?

Won't work since some "extensions" actually match part of the base.

Hmm @BurntSushi any idea why regexset is out by that much here?

Nope. Needs investigation. It might not be handling the is_match special case correctly? Dunno.