A weird benchmark result about same codes but located separately

vague · August 20, 2021, 5:56pm

test gbbq::gbbq_str                           ... bench:  62,806,212 ns/iter (+/- 4,198,622)
test gbbq::gbbq_str_defined_in_crate          ... bench:  26,883,857 ns/iter (+/- 1,706,824)
test gbbq::gbbq_string                        ... bench:  65,836,727 ns/iter (+/- 3,355,627)
test gbbq::gbbq_string_defined_in_crate       ... bench:  65,671,881 ns/iter (+/- 4,196,998)

test gbbq_async::gbbq_str                     ... bench:  63,689,579 ns/iter (+/- 3,565,807)
test gbbq_async::gbbq_str_defined_in_crate    ... bench:  27,713,700 ns/iter (+/- 2,540,403)
test gbbq_async::gbbq_string                  ... bench:  67,637,128 ns/iter (+/- 3,493,143)
test gbbq_async::gbbq_string_defined_in_crate ... bench:  28,840,113 ns/iter (+/- 2,735,867)

Let me explain some contexts.

There are two mods in bench file, gbbq and gbbq_async . The read fn in gbbq mod uses std::fs::read and that in gbbq_async mod uses tokio::fs::read .
And I write exactly the same codes in my crate and benches/xx.rs file, which you can tell from the suffix. The fn name with defined_in_crate suffix uses two iters defined in my crate, and that without defined_in_crate suffix uses the copied version in the bench file. And the iters yield identically except for one field in the yielded struct, for I want to bench the &str vs String field. The yielded structs are like this:

pub struct GbbqStr<'a> {
    s: &'a str,
    ..
}

pub struct GbbqString {
    s: String,
    ..
}

Here is the bench code if you want to have a look.

mod gbbq {
    use super::*;

    #[cfg(feature = "bench-test")]
    #[bench]
    fn gbbq_string_defined_in_crate(b: &mut Bencher) {
        b.iter(|| {
             data()[4..].chunks_exact(29)
                        .map(parse)
                        .map(|b| GbbqStringTDX::from_bytes(&b))
                        .last()
         })
    }

    #[bench]
    fn gbbq_string(b: &mut Bencher) {
        b.iter(|| {
             data()[4..].chunks_exact(29)
                        .map(parse)
                        .map(|b| GbbqString::from_bytes(&b))
                        .last()
         })
    }

    #[bench]
    fn gbbq_str(b: &mut Bencher) {
        b.iter(|| {
             GbbqStr::iter(&mut data()[4..]).last();
         })
    }

    #[bench]
    fn gbbq_str_defined_in_crate(b: &mut Bencher) {
        b.iter(|| {
             Gbbq::iter(&mut data()[4..]).last();
         })
    }
}

#[cfg(feature = "tokio")]
#[cfg(test)]
mod gbbq_async {
    use super::*;

    #[cfg(feature = "bench-test")]
    #[bench]
    fn gbbq_string_defined_in_crate(b: &mut Bencher) {
        b.iter(|| {
             rt().block_on(async { GbbqsStringTDX::from_file("assets/gbbq").await.unwrap().last() })
         })
    }

    #[bench]
    fn gbbq_string(b: &mut Bencher) {
        b.iter(|| {
             rt().block_on(async { GbbqsString::from_file("assets/gbbq").await.unwrap().last() })
         })
    }

    #[bench]
    fn gbbq_str(b: &mut Bencher) {
        b.iter(|| {
             rt().block_on(async {
                     let mut vec = GbbqStr::read_from_file("assets/gbbq").await.unwrap();
                     GbbqStr::iter(&mut vec[4..]).last();
                 })
         })
    }

    #[bench]
    fn gbbq_str_defined_in_crate(b: &mut Bencher) {
        b.iter(|| {
             rt().block_on(async {
                     let mut vec = Gbbq::read_from_file("assets/gbbq").await.unwrap();
                     Gbbq::iter(&mut vec[4..]).last();
                 })
         })
    }
}

The point is that I find the exactly same (with same iter and same yielded item) but located separately codes perform so distinct. Below is a concise table derived from the result, where you would see three pairs of counterparts.

image-20210821010335-71jtrvc

I didn't change the way benchmark works. It's weird Rust treats a bench file different from the lib file, since both should have been optimised by default. But now it seems Rust truly optimised the codes in lib, and left the codes in benches dir less optimised.

As I stated earlier, I meant to bench the performance on &str vs String field. So maybe it's not a good idea to put my old data structs in a bench file. Instead use #[cfg(feature = "bench-test")] or #[cfg(feature = "bench-old")] before a specified mod lying in my crate , from where bench codes invoke with features flags.

mbrubeck · August 20, 2021, 6:12pm

The benches/xx.rs file is compiled as a binary crate which links to your library crate. Because Rust crates are separately compiled, this can inhibit certain optimizations. Two ways to regain lost optimization potential are:

Use the #[inline] attribute to enable cross-crate inlining. In this case, I would try adding the attribute to the GbbqString::from_bytes, parse, and/or data functions.
Enable link-time optimization (LTO) for the release and bench profiles in Cargo.toml. This enables whole-program optimization, but it can also make builds take a very long time for large projects.

vague · August 21, 2021, 3:29am

Cool!

I turn the lto on, and the result looks acceptable.

test gbbq::gbbq_str                           ... bench:  26,985,034 ns/iter (+/- 25,430,328)
test gbbq::gbbq_str_defined_in_crate          ... bench:  25,645,987 ns/iter (+/- 2,551,203)
test gbbq::gbbq_string                        ... bench:  28,644,978 ns/iter (+/- 1,935,411)
test gbbq::gbbq_string_defined_in_crate       ... bench:  28,553,573 ns/iter (+/- 2,462,146)

test gbbq_async::gbbq_str                     ... bench:  26,516,911 ns/iter (+/- 2,379,532)
test gbbq_async::gbbq_str_defined_in_crate    ... bench:  25,701,251 ns/iter (+/- 4,469,176)
test gbbq_async::gbbq_string                  ... bench:  29,531,931 ns/iter (+/- 1,735,944)
test gbbq_async::gbbq_string_defined_in_crate ... bench:  27,554,222 ns/iter (+/- 2,112,521)

Then I make the parse fn inlined, because it's CPU-intensive. And what surprises me is the codes defined in the bench file might be a bit faster than that defined in the lib.

test gbbq::gbbq_str                           ... bench:  25,346,134 ns/iter (+/- 3,117,684)
test gbbq::gbbq_str_defined_in_crate          ... bench:  25,371,048 ns/iter (+/- 2,040,108)
test gbbq::gbbq_string                        ... bench:  28,372,116 ns/iter (+/- 1,456,295)
test gbbq::gbbq_string_defined_in_crate       ... bench:  28,447,700 ns/iter (+/- 9,565,707)

test gbbq_async::gbbq_str                     ... bench:  25,504,166 ns/iter (+/- 827,544)
test gbbq_async::gbbq_str_defined_in_crate    ... bench:  25,752,713 ns/iter (+/- 1,465,699)
test gbbq_async::gbbq_string                  ... bench:  28,876,222 ns/iter (+/- 1,363,418)
test gbbq_async::gbbq_string_defined_in_crate ... bench:  27,699,906 ns/iter (+/- 1,861,708)

Thanks for the advice. I appreciate it ~

A little puzzle hit me that what's the difference between opt-level and lto ? Forgive me for the unfamiliarity with C and the knowledge of low level aspect.

mbrubeck · August 21, 2021, 4:44am

"opt-level" changes what optimization passes are enabled. At lower opt-levels, some optimizations are disabled because they take a long time, or because they interfere with debuggers.

"lto" changes when optimization happens. Normally, each crate is compiled and optimized individually, and then they are linked together. But with LTO turned on, the optimization happens at the end, during linking, so it can run on all the crates combined.

system · November 19, 2021, 4:44am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Strange benchmark effects help	7	1404	January 12, 2023
How to make good use of black_box from the test crate help	2	883	January 12, 2023
Rust vs Go benchmark help	12	1455	March 31, 2024
Help benchmarking new lock implementation help	1	342	January 12, 2023
I'm curious: benchmarking and std::thread::sleep help	5	1687	January 12, 2023

A weird benchmark result about same codes but located separately

Related Topics