Is my benchmark valid? (Testing if matching an enum with 10 variants is faster than matching an enum with 5 variants that contain an enum with two variants)

Hi,

So I'm wondering if it makes more sense to create one big enum with all variants at one level, or one enum with the categories of the data and "sub enums" as the value of enum variants.

So basically:

enum C10  {A, B, C, D, E, F, G, H, I, J}

vs.


enum C5x2 {
    A(C5x2A),
    B(C5x2B),
    C(C5x2C),
    D(C5x2D),
    E(C5x2E), 
}

enum C5x2A {A, B}
enum C5x2B {A, B}
enum C5x2C {A, B}
enum C5x2D {A, B}
enum C5x2E {A, B}

I expected the second approach to be faster, but in my benchmark there basically is no difference:

match/combined_10       time:   [1.3216 ns 1.3698 ns 1.4248 ns]                               
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
match/splitted_5x2      time:   [1.3283 ns 1.3436 ns 1.3608 ns]                                
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

I'm now wondering if my benchmark code is correct (which you can see here).

Is there maybe only a difference with a higher number of variants?

In my actual code, I will have more than 150 variants, when combined in one enum. Should I create a benchmark with more variants?

Also: I also have to convert a text representation of these variants to the corresponding variant – can I here be sure that the second approach would be faster (because the compiler can't optimize string matching as much as enum variant matching), or should I create a benchmark for this as well?

Your benchmark isn't valid, because your code doesn't have any side effects. LLVM aggressively removes code that does useless work, so all the match statements are likely completely removed from your benchmark.

You need to make your functions compute something which depends on actually using these enums, and which the compiler can't do at compile time.

1 Like

@kornel: Thank you. Would implementing Debug and then returning a string representation from the match arms archive that?

In general (and without looking at the separate complexities of benchmarking) I would expect the two-level implementation to be faster if and only if there is some functions in the program that takes an argument of a type like C5x2A. With the first implementation, the program would have to check run-time if the input is valid for that function, in the second variant that check is already done at compile-time.

Debug impl may do a lot of work, and be much more expensive than the matching you're testing, so it'd be hard to get meaningful number from measurement noise. You could do light work, e.g. make each arm return a different integer, and sum that integers up.

Make the enums come from outside of the benchmarked function. If they're in a const array, the compiler can do match and for at compile time instead. Rust does that a lot. For example:

return (0..1000).map(|x| x * 2).sum();

has no iterators, no closures, no math. It compiles to

return 999000;

Use, black_box both on the input and output. That way LLVM cant remove all the code, but you dont end up doing work on Debug, which I suspect is orders of magnitude slower than the matching, which llvm can be very good at

1 Like

Wow, I made the changes (hopefully correctly... :wink:) and the second approach is much slower:

     Running target/release/deps/match-382f75f363db390f
Gnuplot not found, using plotters backend
match/combined/1        time:   [4.4023 ns 4.4088 ns 4.4162 ns]                              
                        change: [-14.377% -12.631% -10.894%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe
match/splitted/1        time:   [14.320 ns 14.353 ns 14.388 ns]                              
                        change: [-1.5573% -0.9014% -0.2515%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  6 (6.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
match/combined/10       time:   [31.935 ns 32.017 ns 32.128 ns]                               
                        change: [-0.0272% +0.5210% +1.0540%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe
match/splitted/10       time:   [136.43 ns 136.81 ns 137.22 ns]                              
                        change: [-1.1140% -0.6430% -0.2014%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) low mild
  6 (6.00%) high mild
  4 (4.00%) high severe
match/combined/100      time:   [359.81 ns 361.02 ns 362.51 ns]                               
                        change: [-0.1965% +0.2309% +0.6455%] (p = 0.32 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe
match/splitted/100      time:   [1.3026 us 1.3066 us 1.3102 us]                                
                        change: [-2.2253% -1.5760% -1.0149%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

[dh@arch rust]$ 

This was completely unexpected to me... :slight_smile:

Can I expect a similar improvement when I, for example, combine two levels of matching with a tuple?

Something like:

match foo {
    Foo::A(x) => match x {
        X::A => todo!(),
        X::B => todo!(),
    },
    Foo::B(x) => match x {
        X::A => todo!(),
        X::B => todo!(),
    },
}

vs.

match (foo, x) {
    (Foo::A, X::A) => todo!(),
    (Foo::A, X::B) => todo!(),
    (Foo::B, X::A) => todo!(),
    (Foo::B, X::B) => todo!(),
}

Or do I always have to benchmark?

@aDotInTheVoid: Thanks, I also have used black_bock above and it did definitely change something (benchmarks were slower)

In general (and without looking at the separate complexities of benchmarking) I would expect the two-level implementation to be faster if and only if there is some functions in the program that takes an argument of a type like C5x2A . With the first implementation, the program would have to check run-time if the input is valid for that function, in the second variant that check is already done at compile-time.

Do you mean, if I pass the value within the match arm to a function?

In this case (in my actual code), I use the enum to wrap several other types, to be able to pass a value (which contains these other types) to a function (similarly to trait objects, but with access to the underlying data without downcasting).

Yes. Or maybe more generally, if there is parts of your code where a value can only be a subset of your enum values, it may make sense (both for correctness and for optimization) to put that subset in a separate enum and wrap that in one value of the original enum.

One thing to note is that C5x2 is bigger than C10, 2 bytes instead of just 1. If you ever have data fields in the enums, that difference may grow as the tag matches the alignment size.

@kaj: Okay, thanks. I believe that this isn't the case in my specific use case, but I will keep this in mind for the future.

@cuviper: Right, I didn't think about memory (on the other hand, this particular code anyway is very memory inefficient, because I store 150+ different types directly in this enum, so all variants use as much memory as the biggest type, if I understand everything correctly. The data should be mostly short-lived, so it shouldn't matter much, however).

To answer my own question:

I also have to convert a text representation of these variants to the corresponding variant – can I here be sure that the second approach would be faster (because the compiler can't optimize string matching as much as enum variant matching), or should I create a benchmark for this as well?

...in case someone is interested:

I created a benchmark with the real-world data that I will use.

Basically, I have a string of the format "$domain.$event" and I thought it would be faster to split the string on '.' and then first match on the domain, and then within the domain match arm, match on the event.

So:

        result += match x {
            &"Debugger.breakpointResolved" => 0,
            // many more...
        }

vs:

        let mut splitted = x.split('.');
        let domain = splitted.next().unwrap(); // Can never fail
        let event = splitted.next().unwrap(); // Can fail with external input
        result += match domain {
            "Debugger" => match event {
                "breakpointResolved" => 0,
                // more...
            },
            // many more...
        }

I was sure that the second approach would be much faster than the first one.

...but I was completely wrong... :laughing:

The first approach is 7 times faster:

string match/combined/10                                                                             
                        time:   [8.1844 us 8.2020 us 8.2235 us]
                        change: [-2.5612% -1.7779% -1.0984%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
string match/splitted/10                                                                            
                        time:   [57.714 us 57.815 us 57.925 us]
                        change: [+0.9717% +1.6919% +2.3162%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Which is amazing, because the second approach was more difficult to implement.

if someone has an idea about how to make the code faster, I'd be interested in benchmarking that version.

The benchmark code can be found here, if someone is interested to have a look.

A perfect hash map might be faster, but its always worth benchmarking. Also, just a tip, run clippy. It'll catch smells like &Vec<T>.

Thanks for the suggestions, @aDotInTheVoid!

Unfortunatley, it seems PHF can only be used with &'static str as keys (my strings are comming from JSON data):

error[E0621]: explicit lifetime required in the type of `input`
   --> benches/string_match.rs:493:26
    |
490 | fn combined_phf(input: &Vec<&str>){
    |                        ---------- help: add explicit lifetime `'static` to the type of `input`: `&std::vec::Vec<&'static str>`
...
493 |         result += EVENTS.get(x).unwrap();
    |                          ^^^ lifetime `'static` required

error: aborting due to previous error

Regarding Clippy: When I run cargo clippy, nothing is shown:

[dh@arch rust]$ cargo clippy
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
[dh@arch rust]$ 

I think there is a problem with my clippy installation (which was installed via 'rustup').

Actually, the problem was a bug in my code :laughing:

rust-phf is a bit faster than my second approach, but still much slower than using one flat match statement:

string match/combined/10                                                                             
                        time:   [8.9594 us 9.0066 us 9.0581 us]
                        change: [+3.3960% +4.2953% +5.1529%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
string match/combined_phf/10                                                                             
                        time:   [45.844 us 46.109 us 46.397 us]
                        change: [-1.5426% -0.7794% +0.0650%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
string match/splitted/10                                                                            
                        time:   [57.938 us 58.290 us 58.624 us]
                        change: [-0.0180% +0.7528% +1.4636%] (p = 0.05 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Anyway, still thanks for your suggestion! :slight_smile: