Matching against enums, discrepancy between assembly and performance

I wrote some types that wrap a i32 or a [u8;4] inside an enum than I benchmarked it with Criterion (execution times are written as comments below). I also inspected the assembly generated with godbolt. What I see is something that let me with a lot of questions.

First thing first, the code:

#[derive(Copy, Clone)]
pub enum Wrapper1 {
    A(i32),
}

impl Wrapper1 {
    #[inline(always)]
    pub fn get_inner(self) -> i32 {
        match self {
            Wrapper1::A(x) => x,
        }
    }
}

// get_inner for wrapper 1 it took ~= 610 ps

#[derive(Copy, Clone)]
pub enum Wrapper2 {
    A([u8;4]),
}

impl Wrapper2 {
    #[inline(always)]
    pub fn get_inner(self) -> [u8;4] {
        match self {
            Wrapper2::A(x) => x,
        }
    }
}

// get_inner for wrapper 2 it took ~= 610 ps

#[derive(Copy, Clone)]
pub enum Wrapper3 {
    A([u8;4]),
    B([u8;4]),
}

impl Wrapper3 {
    #[inline(always)]
    pub fn get_inner(self) -> [u8;4] {
        match self {
            Wrapper3::A(x) => x,
            Wrapper3::B(x) => x,
        }
    }
}

// get_inner for wrapper 3 it took ~= 6.15 ns

#[derive(Copy, Clone)]
pub enum Wrapper4 {
    A([u8;4]),
    B([u8;4]),
    C([u8;4]),
    D([u8;4]),
    E([u8;4]),
}
impl Wrapper4 {
    #[inline(always)]
    pub fn get_inner(self) -> [u8;4] {
        match self {
            Wrapper4::A(x) => x,
            Wrapper4::B(x) => x,
            Wrapper4::C(x) => x,
            Wrapper4::D(x) => x,
            Wrapper4::E(x) => x,
        }
    }
}

// get_inner for wrapper 4 it took ~= 6.15 ns

#[derive(Copy, Clone)]
pub enum Wrapper5 {
    A(i32),
    B(i32),
}

impl Wrapper5 {
    #[inline(always)]
    pub fn get_inner(self) -> i32 {
        match self {
            Wrapper5::A(x) => x,
            Wrapper5::B(x) => x,
        }
    }
}

// get_inner for wrapper 5 it took ~= 900 ps

As you can see above accessing the wrapped value for enums with just one variant it took the same time for i32 and [u8;4] ~= 610 ps.
But accessing the wrapped value for enums with two (or more) variants is much longer for enum that wrap a [u8;4] (~= 6.15 ns) than the ones that wraps an i32 (~= 900 ps)
It's also interesting to note that accessing an i32 value inside an enums it took a very similar time for enums with one or more variants. That's not true for enums that wrap an [u8;4]

Now I'm very curios because:

  1. I was expecting the compiler to optimize away all the matching branches and just return the inner value
  2. I can't explain the big differences in performance, between types so similar.

So I tried to inspect the asembly generated by the above code with godbolt and I get this for the i32 wrapper and that for the [u8;4] wrapper.

What I can note is that, the assembly generated for matching against the variants of enums that wrap an i32 is exactly the same for enums with one ore more variants. On the contrary the assembly generated by the enums that wrap an [u8;4] is very similar: the enums with more than one variant got one instruction more.

Now more questions arises:
3. Why if the assembly generated is the same, in the i32 case. Enums that wraps i32 values have different performances in accessing the inner value?
4. Why in the [u8;4] case enums with more than one variants are a lot slower in accessing the inner value than the enums in the i32 case when the former have just one instruction more?

If someone want to run the benchmarks they are here

Maybe because it is using word aligned loads for i32, but it can't for [u8; 4]. The only real difference is their alignment.

3 Likes

Right, if everything is aligned at 4 bytes I got consistent times. Ty

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.