Odd 2x performance gap with structs of different sizes

I've got a a more detailed writeup and benchmark code but here's a summary of the issue I'm having.

The workload in question (a simplified version of real code) is quite small: read a few bytes from a Read, then return a struct. See the link for source; there are various flavors of this workload but they're all the same idea.

I'm getting counterintuitive relative performance with various sizes of structs. A struct with three Option<u8> and another with one Option<u16> and five Option<u8> both give me fast performance, but a struct with one Option<u16> and two Option<u8> is half as fast. In other words, both smaller and larger structs are 2x as fast. However, this performance gap only happens when mapping Err values (from the i/o) a particular way, and only for that struct size.

What could be causing this?

Total amateur here, but here's what I notice on x86_64:

I added #[no_mangle] tags to the functions in benchers, and generated decompiled forms. Very little is visibly different between the read_... functions from benchers, but it appears that the functions have been inlined into a function in test.

Let's focus on these three:

       #[bench]ed function                               func where inlining occured
(slow) read_1xu16_2xu8_fancy_error_result_error_kind <=> _ZN4test13ns_iter_inner17h7138bd88f56fd1e8E
(fast) read_1xu16_2xu8_fancy_error_bare_error_kind   <=> _ZN4test13ns_iter_inner17h54822161ff6ede4eE
(fast) read_1xu16_5xu8_fancy_error_result_error_kind <=> _ZN4test13ns_iter_inner17h37bb7a4c059ff3a8E

Here I've chosen one slow function and two fast ones that differ from it in different ways.

Comparing the ns_iter_inners for the two fast ones, I see no big difference. So far so good. (this is the control!)

Comparing the slow one with a fast one... indeed, they're quite different. And not just by a couple of instructions; it seems a large piece of control flow has shifted between them. (an explanation for which is far out of my league!)

Here are dissassembled sources for a slow one and a fast one; look at them in your favorite diff tool.

I agree, the way that LLVM has decided to handle the control flow is very different. When running the different cases with perf or oprofile (which of course will show you the same disassembly) I see quite different hot spots, which isn't surprising.

I guess really my question is why is LLVM/rustc treating the slow cases differently?

In the real code I simply stopped including ErrorKind in the relevant enum variant, but that's not very satisfying.

1 Like