Potential missed optimization regarding multiple match expressions

I was looking at the source code for char::is_ascii_hexdigit(), and found, to my surprise, that it used this code:

pub const fn is_ascii_hexdigit(&self) -> bool {
    matches!(*self, '0'..='9') | matches!(*self, 'A'..='F') | matches!(*self, 'a'..='f')
    
    // EXPANDS TO (not part of the function):
    (match var {
        '0'..='9' => true,
        _ => false,
    }) | match var {
        'A'..='F' => true,
        _ => false,
    } | match var {
        'a'..='f' => true,
        _ => false,
    }
    
}

Instead of what would be the neater version:

pub const fn is_ascii_hexdigit(&self) -> bool {
    matches!(*self, '0'..='9' | 'A'..='F' | 'a'..='f')
    
    // EXPANDS TO (not part of the function):
    match var {
        '0'..='9' | 'A'..='F' | 'a'..='f' => true,
        _ => false,
    }
}

To my even greater surprise, God said the standard library variant gets optimized better (or, I assume it's better, since it avoids a branch and has less instructions), while I expected them both to be optimized down to the same thing. I'm not sure what's going on there. Going from either to the other seems like a trivial transformation to me.

Some things I found:

  • If you use the expanded form of the standard library variant and edit it to use the boolean comparison || instead of the BitOr | that it uses, the binary changes into the variant with the branch, and both functions become a call to that same code segment. I feel like some optimization pass could check if | and || are equivalent in a given context, and use the better one, if that makes a difference.
  • I dug into which optimizations could be performed prior to handing it off to the codegen backend, and learned about MIR transformations. I found this MIR transformation for optimizing match branches. I wasn't able to make much sense of the logic there, so I don't know whether this pass could be expanded upon or tuned to expand match expressions like these into the faster variant.
3 Likes

You'll probably be interested in the various investigations that happened in Make is_ascii_hexdigit branchless by GKFX · Pull Request #103024 · rust-lang/rust · GitHub

Going from || to | is hard in general for LLVM, because of the short-circuiting. LLVM doesn't want to add new "unnecessary" load instructions that came from the not-always-executed part, for example.

Sometimes it'll notice that it's cheaper just to evaluate it anyway, but you'll often see a big difference between things taking T and things taking &T for that. Note that is_ascii_hexdigit is &self -- for historical reasons only, as it would be better as self -- which IIRC also makes things more awkward here.

But it's often easier for LLVM to replace | with || by proving that it's ok to not do something, rather that proving that it's faster to do something "unnecessary".

It'd be great if LLVM figured this out perfectly more often, but it's hard. And for the is_ascii_hexdigit case specifically, it's actually extra difficult because it ends up triggering different clever tricks to make it branchless, in certain scenarios, which end up hiding the optimization that you actually expected.

9 Likes

This shouldn't be an issue here (in theory) because the reference is unconditionally read from and the noalias annotation permits consolidating the reads. If LLVM doesn't particularly like doing so, the source could be rewritten to load a local char value first.

But in general, indirections are a common cause of missed optimization.

The underlying issue with this is that knowing which is better is hard. Generally, doing less computation is preferable, and switching to branchless is usually only beneficial if it can be further optimized. So this transform would likely need to be fused with the pass that optimizes the bitwise version to avoid pessimizing more cases than it improves.

What would likely be the most beneficial would be if LLVM had a range-based switch instruction that could be used to more directly encode the pattern match condition instead of it being a series of comparisons. If we use a match that lists each possibly individually instead of with ranges, we get a different branchless codegen.

pub const fn case3(&c: &char) -> bool {
    matches!(c,
        | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
        | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
        | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'
    )
}
2 Likes

This is part of why this specific example is hard for the optimizer. Because it needs to not go "oh, I see, you're matching a bunch of stuff, let me make a clever encoding of a lookup table", even though for many small perturbations that's the best answer.

To get the codegen people are expecting, it needs to realize the "well because of the exact properties of ascii we can just mask out a certain bit, and in fact none of the inputs we need to care about that, so we can do it always, and thus we have this better approach than what we'd normally do".

And, well, that's just not that obvious.

3 Likes

Thank you for the insightful comments, @scottmcm and @CAD97! And thanks for linking the pull request, I read the discussions on it and it cleared up a lot. It's not often I get nerd sniped about something very specific that I found by accident and there's a record of people that have been discussing it in detail, so that was convenient too.

2 Likes

I wonder why it doesn't have it while LLVM's main language does have it.

It seems like it taking a &char instead of a char results in one extra instruction:

mov     eax, dword ptr [rdi]

I guess they can't change this function now because of the backwards compatibility guarantees? Kind of a shame. Since char and u8 are both Copy I assume there's only downsides to taking a reference. You get an extra instruction, and you're sending more data: a pointer, whose size depends on the platform, but probably 32 or 64 bits, whereas a copied u8 or char is just 8 bits. Not quite familiar enough with all of this to be certain this is correct, but yeah.

Nit: char is 32 bits.

2 Likes

Realistically, the function is always inlined so it doesn't actually end up having that load if it's not needed.

But yeah, Copy stuff the size of a pointer or smaller should almost always be self instead of &self (and somewhat bigger than that too, but it's most obvious for ≤pointer-sized).

Ah, yeah that makes sense.
Any idea why it was decided to make these take a reference rather than a copy? Is/was there a good reason for it?

Historical reasons: It used to be https://doc.rust-lang.org/1.1.0/std/ascii/trait.AsciiExt.html, which needed to work on slices too, which are of course not Copy.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.