Unexpected behavior of closure inlining in a generic lib function

I am currently experimenting with a generic functional approach to deserialize any type of struct by passing a function-array to a generic deserialize function to deserialize each struct member.

I have always compared the optimized assembly of the generic function with a hard coded variant (with cargo asm) and could not see any difference in the lib itself. In my benchmarks with criterion, however, the functional variant was always ~110% slower.

So I investigated further and found out that it makes a difference if I call a generic function in a lib or if I call it from a main.rs. I implemented an example where I call both functions from a main. One time calling the generic itself, one time calling a non-generic lib function that implement the generic lib fn. Then I disassembled the binary and it showed (like the benches) that the version implemented in the lib is way more optimised.

For me it would have been explained by missing inlining, but it looks like the closures are also inlined. But not optimized in the whole like in the lib version.

In general I would be interested how a generic function from a lib is compiled? Since it can only be compiled when it is implemented in main, I would have thought that both functions are optimized the same way.

If you want to try it out yourself, you can find a repo with the code and description to get the asm here:

https://github.com/tjensen42/rust-lib-closure-test

// lib.rs
use bytes::Buf;

#[derive(Debug, Default)]
pub struct Color {
    pub r: u8,
    pub g: u8,
    pub b: u8,
}

// Deserialize some struct with 3 Fields
#[inline(never)]
pub fn deserialize_struct<S, F>(reader: &mut &[u8], data: &mut S, func: &[F; 3])
where
    F: Fn(&mut &[u8], &mut S),
{
    if reader.remaining() >= 3 {
        func[0](reader, data);
        func[1](reader, data);
        func[2](reader, data);
    }
}

// Implement the generic function in the library
pub fn deserialize_color_generic(reader: &mut &[u8], color: &mut Color) {
    deserialize_struct(reader, color, &DESER_COLOR_CLOSURE_LIB)
}

// Closure array to deserialize a Color struct
pub const DESER_COLOR_CLOSURE_LIB: [fn(&mut &[u8], &mut Color); 3] = [
    |r, s| s.r = r.get_u8(),
    |r, s| s.g = r.get_u8(),
    |r, s| s.b = r.get_u8(),
];
// main.rs
use std::hint::black_box;
use bytes::Buf;
use rust_lib_closure_test::{deserialize_color_generic, deserialize_struct, Color};

pub fn main() {
    let buf: Vec<u8> = Vec::from([0x01, 0x02, 0x03]);

    // Call the generic function indirect (implemented in lib)
    let mut color = Color::default();
    let cursor = &mut buf.as_slice();
    deserialize_color_generic(black_box(cursor), &mut color);
    println!("color: {:?}", color);

    // Call the generic function direct
    let mut color = Color::default();
    let cursor = &mut buf.as_slice();
    deserialize_struct(black_box(cursor), &mut color, &DESER_COLOR_CLOSURE_BIN);
    println!("color: {:?}", color);
}

const DESER_COLOR_CLOSURE_BIN: [fn(&mut &[u8], &mut Color); 3] = [
    |r, s| s.r = r.get_u8(),
    |r, s| s.g = r.get_u8(),
    |r, s| s.b = r.get_u8(),
];
; ASM of deserialize_struct called from inside lib.rs
0000000000008c90 <_ZN21rust_lib_closure_test18deserialize_struct17h7cac8bcc2fad5118E>:
    8c90:   48 8b 47 08             mov    0x8(%rdi),%rax
    8c94:   48 83 f8 02             cmp    $0x2,%rax
    8c98:   76 25                   jbe    8cbf <_ZN21rust_lib_closure_test18deserialize_struct17h7cac8bcc2fad5118E+0x2f>
    8c9a:   48 8b 0f                mov    (%rdi),%rcx
    8c9d:   0f b6 11                movzbl (%rcx),%edx
    8ca0:   88 16                   mov    %dl,(%rsi)
    8ca2:   0f b6 51 01             movzbl 0x1(%rcx),%edx
    8ca6:   88 56 01                mov    %dl,0x1(%rsi)
    8ca9:   0f b6 51 02             movzbl 0x2(%rcx),%edx
    8cad:   48 83 c1 03             add    $0x3,%rcx
    8cb1:   48 83 c0 fd             add    $0xfffffffffffffffd,%rax
    8cb5:   48 89 0f                mov    %rcx,(%rdi)
    8cb8:   48 89 47 08             mov    %rax,0x8(%rdi)
    8cbc:   88 56 02                mov    %dl,0x2(%rsi)
    8cbf:   c3                      ret
; ASM of deserialize_struct called from main.rs
0000000000008b90 <_ZN21rust_lib_closure_test18deserialize_struct17h78300619e9e5bdc1E>:
    8b90:   41 57                   push   %r15
    8b92:   41 56                   push   %r14
    8b94:   53                      push   %rbx
    8b95:   48 83 7f 08 02          cmpq   $0x2,0x8(%rdi)
    8b9a:   76 26                   jbe    8bc2 <_ZN21rust_lib_closure_test18deserialize_struct17h78300619e9e5bdc1E+0x32>
    8b9c:   49 89 d7                mov    %rdx,%r15
    8b9f:   49 89 f6                mov    %rsi,%r14
    8ba2:   48 89 fb                mov    %rdi,%rbx
    8ba5:   ff 12                   call   *(%rdx)
    8ba7:   48 89 df                mov    %rbx,%rdi
    8baa:   4c 89 f6                mov    %r14,%rsi
    8bad:   41 ff 57 08             call   *0x8(%r15)
    8bb1:   49 8b 47 10             mov    0x10(%r15),%rax
    8bb5:   48 89 df                mov    %rbx,%rdi
    8bb8:   4c 89 f6                mov    %r14,%rsi
    8bbb:   5b                      pop    %rbx
    8bbc:   41 5e                   pop    %r14
    8bbe:   41 5f                   pop    %r15
    8bc0:   ff e0                   jmp    *%rax
    8bc2:   5b                      pop    %rbx
    8bc3:   41 5e                   pop    %r14
    8bc5:   41 5f                   pop    %r15
    8bc7:   c3                      ret
    8bc8:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
    8bcf:   00

Full source code: https://github.com/tjensen42/rust-lib-closure-test

I also tried to use the "DESER_COLOR_CLOSURE_LIB" in the main(), but this does not make any difference.

You might want to try adjusting profiles.release.lto in Cargo.toml to see if it makes a difference.

I think the difference is that DESER_COLOR_CLOSURE_LIB contains non-generic functions, which don't get inlined into different crates (the code to do that isn't available). Try adding #[inline] to each closure to make it available for inlining. (I'm not sure whether them being inside a const with function pointer types will interfere, though.)

1 Like

Thank you for the quick reply. In fact, the assembly is then the same. I don't have any benchmarks for the simplified version yet, but they should be equally fast in this case.

Nevertheless, with more complex problems, unfortunately, I have not achieved the same benchmark results with lto "fat", probably in more complex cases they are inlined differently again.

And overall, my benchmarks with lto are unfortunately always slower than without lto.

The example seems to show that it is indeed because crate and binaray compiled separately here. How this is possible despite generic I still wonder.

00000000000106c0 <_ZN21rust_lib_closure_test18deserialize_struct17h4cacb9a8beb21afbE>:
   106c0:	48 8b 47 08          	mov    0x8(%rdi),%rax
   106c4:	48 83 f8 02          	cmp    $0x2,%rax
   106c8:	76 47                	jbe    10711 <_ZN21rust_lib_closure_test18deserialize_struct17h4cacb9a8beb21afbE+0x51>
   106ca:	48 8b 0f             	mov    (%rdi),%rcx
   106cd:	44 0f b6 01          	movzbl (%rcx),%r8d
   106d1:	48 8d 51 01          	lea    0x1(%rcx),%rdx
   106d5:	4c 8d 48 ff          	lea    -0x1(%rax),%r9
   106d9:	48 89 17             	mov    %rdx,(%rdi)
   106dc:	4c 89 4f 08          	mov    %r9,0x8(%rdi)
   106e0:	44 88 06             	mov    %r8b,(%rsi)
   106e3:	44 0f b6 41 01       	movzbl 0x1(%rcx),%r8d
   106e8:	48 8d 51 02          	lea    0x2(%rcx),%rdx
   106ec:	4c 8d 48 fe          	lea    -0x2(%rax),%r9
   106f0:	48 89 17             	mov    %rdx,(%rdi)
   106f3:	4c 89 4f 08          	mov    %r9,0x8(%rdi)
   106f7:	44 88 46 01          	mov    %r8b,0x1(%rsi)
   106fb:	0f b6 51 02          	movzbl 0x2(%rcx),%edx
   106ff:	48 83 c1 03          	add    $0x3,%rcx
   10703:	48 83 c0 fd          	add    $0xfffffffffffffffd,%rax
   10707:	48 89 0f             	mov    %rcx,(%rdi)
   1070a:	48 89 47 08          	mov    %rax,0x8(%rdi)
   1070e:	88 56 02             	mov    %dl,0x2(%rsi)
   10711:	c3                   	ret
   10712:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
   10719:	00 00 00
   1071c:	0f 1f 40 00          	nopl   0x0(%rax)

the r.get_u8() functions are from the bytes crate. I import them with use bytes::Buf;. So they are in both cases from a different crate.

A #[inline] over the closure does not make any difference. (But nice to know, did not know that it is possible)
Tried it like this:

const DESER_COLOR_CLOSURE_BIN: [fn(&mut &[u8], &mut Color); 3] = [
    #[inline]
    |r, s| s.r = r.get_u8(),
    #[inline]
    |r, s| s.g = r.get_u8(),
    #[inline]
    |r, s| s.b = r.get_u8(),
];

So I found out that its probably just because of the #[inline(never)] over the generic lib function...

When I implement the generic deserialize_struct in a wrapper (in main.rs) like I did in the lib.rs, the generic will be inlined and optimized like the lib version.

So this code results in the same optimized assembly. I compared deserialize_color_generic and deserialize_color_generic_main:

pub fn main() {
    let buf: Vec<u8> = Vec::from([0x01, 0x02, 0x03]);

    // Call the generic function indirect (implemented in lib)
    let mut color = Color::default();
    let cursor = &mut buf.as_slice();
    deserialize_color_generic(black_box(cursor), &mut color);
    println!("color: {:?}", color);

    // Call the generic function direct
    let mut color = Color::default();
    let cursor = &mut buf.as_slice();
    deserialize_color_generic_main(cursor, &mut color);
    println!("color: {:?}", color);
}

const DESER_COLOR_CLOSURE_BIN: [fn(&mut &[u8], &mut Color); 3] = [
    |r, s| s.r = r.get_u8(),
    |r, s| s.g = r.get_u8(),
    |r, s| s.b = r.get_u8(),
];

// Implement the generic function in the main.rs
#[inline(never)]
fn deserialize_color_generic_main(reader: &mut &[u8], color: &mut Color) {
    deserialize_struct(reader, color, &DESER_COLOR_CLOSURE_BIN)
}

With this the original question is more or less clarified, nevertheless the behavior with #[inline(never)] above the generic function is still not clear to me. If someone should have an explanation for this I would be glad.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.