Missing Enum optimization?

I have a simple enum that is frequently tested for equality. My expectation was that rust could optimize this to be a cheap 16-bit comparison (cmpw). However looking in the asm generated on compiler explorer, I see that it is actually doing a very expensive operation and using a lookup table. Is there a reason rust is not optimizing this? Is there a way to get the compiler to optimize this effectively?

2 Likes

I'm on mobile right now and can't test it myself:
Out of curiosity, what happens if the enum has 255 variants so that there is no more niche left that the compiler might worry about?

Otherwise maybe it would make sense to create a bugticket for rustc? (But maybe this is more of a llvm problem, I don't know)

2 Likes

I'm not entirely sure, but the most logical thing would be that rather than using a u8 to represent the variants, it expands that to a larger unsigned integer type.

It does appear to be using an u8 for the discriminant.

Interestingly these two function also compile to something vastly different:

pub fn compare_u16(x1: u16, x2: u16) -> bool {
    x1 == x2
}

pub fn compare_u8_tuple(x1: (u8,u8), x2: (u8,u8)) -> bool {
    x1 == x2
}

As expected the version that compares tuples containing two u8 values is nearly 35%(!) slower than the version for u16:

cmp_2xu8                time:   [1.5509 ns 1.5516 ns 1.5522 ns]                      
                        change: [-0.2705% -0.1227% +0.0179%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) low mild
  6 (6.00%) high mild
  4 (4.00%) high severe

cmp_u16                 time:   [1.1459 ns 1.1460 ns 1.1462 ns]                     
                        change: [+0.0027% +0.0625% +0.1142%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe

Benchmarked using the following Criterion code:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

pub fn compare_u16(x1: u16, x2: u16) -> bool {
    x1 == x2
}

pub fn compare_u8_tuple(x1: (u8, u8), x2: (u8, u8)) -> bool {
    x1 == x2
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("cmp_2xu8", |b| {
        b.iter(|| {
            black_box(compare_u8_tuple(
                (black_box(20), black_box(20)),
                (black_box(19), black_box(19)),
            ))
        })
    });
    c.bench_function("cmp_u16", |b| {
        b.iter(|| black_box(compare_u16(black_box(65000), black_box(20))))
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
2 Likes

For the record: expanding the enum to have 254/255 variants did not help.

2 Likes

The (u8, u8) version is interesting. It looks like the u8s get passed into the function via separate registers. Rust's ABI is unstable and this can probably be done better.

But there really isn't any good explanation I can think of for not optimizing the lookup table on the enum.

LLVM is generally unwilling to introduce underaligned reads, so it's pretty common for it to not optimize adjacent u8s into a single 2-byte read.

Maybe with safe transmute we'll be able to specialize on the cases that can just use raw_eq in std::intrinsics - Rust like we now do for arrays in a few places.

5 Likes

That's certainly part of it but not the whole story: Aligning the tuple changes the resulting code a bit but still doesn't get it down to the single cmp of the u16 case.

#[repr(align(2))]
#[derive(PartialEq, Eq)]
pub struct MyTuple(pub u8, pub u8);

pub fn compare_aligned_u8_tuple(x1: MyTuple, x2: MyTuple) -> bool {
    x1 == x2
}
example::compare_aligned_u8_tuple:
        xor     edi, esi
        mov     eax, edi
        shr     eax, 8
        or      al, dil
        sete    al
        ret
3 Likes

Ha, beat me :slight_smile:
I basically tested exactly the same:

#[repr(align(2))]
#[derive(PartialEq, Eq)]
pub struct AlignedU8Tuple {
    tuple: (u8, u8),
}

pub fn compare_u8_tuple_aligned(x1: AlignedU8Tuple, x2: AlignedU8Tuple) -> bool {
    x1 == x2
}

(Produces exact same code in godbolt)
Benchmark result is basically the same performance as for the u16 variant:

cmp_aligned_2xu8 #2             time:   [1.1416 ns 1.1417 ns 1.1417 ns]                         
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

Setting the alignment to 2 sadly does not help the enum version..

Issue with tuple can be easily fixed if you implement comparison manually:
Compiler Explorer

But the issue with enum is not helped that way. LLVM is general is notoriously weak with handling large switches and likes to use lookup tables for them - and this is exactly how aurogenerated eq is constructed.

Again in mobile, so I can't test it:
Would it help the original question to retrieve the variant index and the associated u8 and then compare them as if they were a u8 tuple?
(Obviously not a nice solution, but it might work and if the enum is in fact compared very frequently...)

Yes, this produces much better code: Compiler Explorer

That is much better! That would make a good default in this case. Though you loose that optimization once you start to introduce other types (even if all the types are the same size and use the same comparison instruction). Or if you introduce a unit variant.

If you have different types, then you can still get decent result: Compiler Explorer (although the way it compared two bytes is funny).
But if you start putting differently-sized payload then this optimization just stops being valid since Rust compiler is not obliged to initialize padding.
This is conscious decision because it makes it easier to make optimized code for cases where chosen variant is known — but it makes comparison harder.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.