How do I inspect the binary output of cargo bench?

I'm writing a project to do some tests related to prefetch intrisics.

As part of that, I have some benchmarks I'm struggling to make sense of. I'd like to inspect the binaries, to see if there's some auto-vectoring or some other differences in the produced code explaining the oddities I'm noticing.

So far I've tried to do:

objdump -D target/release/deps/libtest_prefetch-ee840e119cfd63ca.rlib | grep my_test_function

(test_prefetch is the name of the project)

But I can't seem to find the dump of my benchmarks. I don't have much experience with objdump, and I don't really understand how cargo formats its output, so I'm not sure where I look.

When reading blogs about rust perf, a lot of people write their objdump outputs, so it should be feasible; I feel like I'm missing something obvious.

Often when you run cargo test or cargo bench it'll print the path to the executable containing the tests/benchmarks.

$ cargo test
   Compiling scad-compiler v0.1.0 (~/Documents/scad-rs/crates/compiler)
    Finished test [unoptimized + debuginfo] target(s) in 2.46s
     Running unittests src/ (~/Documents/scad-rs/target/debug/deps/scad_runtime-b1e7fa9756c915ec)

running 1 test
test value::tests::builtin_function_partialeq_only_works_on_identity ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

In this case, the executable would be ~/Documents/scad-rs/target/debug/deps/scad_runtime-b1e7fa9756c915ec (the bit after "Running unittests").

Now let's say I wanted to see the machine code that was generated for my value::tests::builtin_function_partialeq_only_works_on_identity function.

From there, I can pass it to objdump and ask it to --disassemble my code and --demangle symbol names. The demangling is important because then I can search for Rust names.

$ objdump --disassemble --demangle integration_tests-67beaec54ffb568c | less

Using less's builtin search function to find tests::builtin_function_, I can skip past a couple call instructions before finding the section containing the function's machine code.

0000000000015690 <scad_runtime::value::tests::builtin_function_partialeq_only_works_on_identity>:
   15690:       48 81 ec 48 01 00 00    sub    $0x148,%rsp
   15697:       e8 94 fe ff ff          call   15530 <scad_runtime::value::BuiltinFunction::new>
   1569c:       48 89 54 24 60          mov    %rdx,0x60(%rsp)
   156a1:       48 89 44 24 58          mov    %rax,0x58(%rsp)
   156a6:       48 8d 7c 24 58          lea    0x58(%rsp),%rdi
   156ab:       e8 50 06 00 00          call   15d00 <<scad_runtime::value::BuiltinFunction as core::clone::Clone>::clone>

I think the key difference is that I'm checking an executable I know will contain my function, whereas you are checking the rlib, which may not necessarily contain machine code for your function (e.g. the definition for generic functions are embedded as metadata so they can be compiled to machine code by later crates).

1 Like

Thanks! That was tremenduously helpful.

Another thing that helped, in my case, was using #[inline(never)] on the functions I wanted to examine.

Ah yeah that's a good point.

It wasn't much of a concern in my example because Rust's test harness stores a list of function pointers, which means the compiler needs to generate at least one copy of builtin_function_partialeq_only_works_on_identity so we can get a pointer to it.

Another thing that can sometimes help is adding #[no_mangle] so the symbol name is more predictable or dummy functions which just call your function and are annotated with #[inline(never)]. That way you won't forget to remove #[inline(never)] from your performance-sensitive function and accidentally commit it to master.

Right, but in these cases I'm only writing benchmarks anyway.

(And I'm already getting useful stuff. Eg the function that I thought should be much faster actually has a bunch more jumps internally, for some reason)

1 Like

An additional question, for extra credit:

Once I've dumped my assembly, I can roughly see what instructions are executed, where are the jumps (thanks to --visualize-jumps in particular), etc, what's the next step? Are there more tools that help me parse the generated assembly?

In particular, I'm wondering how the dumped assembly maps to the original code. I know that the Godbolt compiler explorer has an interface that shows those mappings with color-coding. Is there some equivalent in the console?

Also, is there a flag you can pass to objdump to only display sections under symbols matching a pattern?

(eg if I want to display the code of functions foo_1, foo_2, foo_bar, I'd like to pass --filter "foo*", is there a way to do that?)

If you want to do a more detailed analysis, it might be time to crack out a fully functional disassembler/reverse engineering tool.

One command-line tool that I used a while back is radare, which is kinda like the vim of reverse engineering. It's super powerful and you can be really productive once you are fluent with it, but there is a non-trivial learning curve associated with it.

Otherwise, you can always use gdb to trace execution and see jumps.

I don't know of a Godbolt-like CLI tool off the top of my head, but in theory it's possible because Godbolt just uses debug information to associate instructions with places in the source code. I wouldn't be surprised if Radare had a similar function, otherwise some debuggers will show disassembled machine code with the corresponding source code as a comment next to it.

You can probably use the --section flag if you know which section your function will be in, but that might be tricky to find out in practice. Normally I just pipe into less and use search to find the things I care about.