Why doesn't the compiler inline this enum matching function?

Out of curiosity rather than any actual perf issue I'm looking at the code generated by matching against two functionally equivalent ways of using enums.

Specifically this nested version

#[derive(Clone, Copy)]
pub enum InputType {
    Mouse,
    Keyboard
}

#[derive(Clone, Copy)]
pub enum ButtonState {
    Usual,
    Hover(InputType),
    Pressed(InputType),
}

... and this version that separates out the variants:

#[derive(Clone, Copy)]
pub enum ButtonStateSep {
    Usual,
    HoverM,
    HoverK,
    PressedM,
    PressedK,
}

As a simplified version of what I want to happen in my actual program, I'm returning u8s instead of more complicated things.

pub fn input(s: InputType) -> u8 {
    use InputType::*;
    match s {
        Mouse => 7,
        Keyboard => 3,
    }
}

These three functionally equivalent functions each generate different machine code:

pub fn sep(s: ButtonStateSep) -> (u8, u8) {
    use ButtonStateSep::*;
    match s {
        Usual => (0, 0),
        HoverM => (1, input(InputType ::Mouse)),
        HoverK => (1, input(InputType::Keyboard)),
        PressedM => (5, input(InputType::Mouse)),
        PressedK => (5, input(InputType::Keyboard)),
    }
}

pub fn nested(s: ButtonState) -> (u8, u8) {
    use ButtonState::*;
    match s {
        Usual => (0, 0),
        Hover(i) => (1, input(i)),
        Pressed(i) => (5, input(i)),
    }
}


macro_rules! input {
    ($input_type: expr) => {
        match $input_type {
            Mouse => 7,
            Keyboard => 3,
        }
    }
}

pub fn nested_inlined(s: ButtonState) -> (u8, u8) {
    use ButtonState::*;
    match s {
        Usual => (0, 0),
        Hover(i) => (1, input!(i)),
        Pressed(i) => (5, input!(i)),
    }
}

sep generates a short sequence of instructions involving only copies and shifts, whereas nested generates longer code which contains two branches. Finally nested_inlined generates a set of instructions which are similar but not identical to those generated by sep.

Compiler Explorer Link

I don't really care about the difference between the machine instructions for sep and nested per se, but I'm curious why the compiler didn't just do the same inlining I did, since even without converting to the bit-shifting version that would still be less code than what was actually produced, AFAICT.

1 Like

So that you can get references InputType (in ButtonState). In the inline version, there is no way to get a reference to an InputType.

i.e.

so that this works.

match &button_state {
    ButtonState::Hover(input_type) => (),
    ButtonState::Pressed(input_type) => (),
    ButtonState::Usual => (),
}
1 Like

Okay, I see how being able to take references restricts what the representation of the enum can be, (two bytes instead of one I suppose?) But given that when I inlined the match with a macro, and the compiler was then able to special case that part, I don't see why the compiler wouldn't be able to see that the input function was only a match statement and then do the same optimization.

If the input function was in another crate or something, then I see that the inline optimization couldn't be made, but that's not the case here.

The layout of the enum determines how the match works.

For example, ButtonState and InputType would be laid out something like this,

InputType
[tag (1 byte)]

ButtonState
[tag (1 byte), data (1 byte)]

ButtonState::Hover and ButtonState::Pressed
[tag (1 byte), data: InputType (1 byte)]

ButtonState::Usual
[tag (1 byte), <uninit> (1 byte)]

So, in order to match on ButtonState, if first has to check the tag of ButtonState, only then can it know if the InputType is even initialized. If InputType is initialized it copies it out and passes it to input to be handled.

If that is true, then should that imply that if the second byte was always initialized then the match could be compiled down to fewer instructions? (at the cost of needing to init the second byte during construction.)

If I add an unused input state or a Padding enum then both nested and nested_inline are longer and contain jumps. Maybe that's just because the compiler assumes that things are constructed more often then any one function is called on them?

If I change the semantics and make all the cases pass through input then all three versions are short and do not include jumps. I'm not sure what that suggests, except that some matches are easier to optimize than others.

Maybe, I'm not sure if those optimizations will happen. But in theory they should. Optimizations are finicky. Your last case seems to support this theory.

One other thing: if you look at the LLVM-IR, you'll see that your InputType becomes an i1 (like a bool). LLVM tends, in my experience, to handle two-valued things differently from three-or-more-valued things. (This causes problems in other places too, like it loses track of the fact that a Result discriminant is only 0 or 1, as it tends to replace the matches with != 0.)

I don't know for sure if that's actually impacting this, but it might be.