A weird benchmark result about same codes but located separately

test gbbq::gbbq_str                           ... bench:  62,806,212 ns/iter (+/- 4,198,622)
test gbbq::gbbq_str_defined_in_crate          ... bench:  26,883,857 ns/iter (+/- 1,706,824)
test gbbq::gbbq_string                        ... bench:  65,836,727 ns/iter (+/- 3,355,627)
test gbbq::gbbq_string_defined_in_crate       ... bench:  65,671,881 ns/iter (+/- 4,196,998)

test gbbq_async::gbbq_str                     ... bench:  63,689,579 ns/iter (+/- 3,565,807)
test gbbq_async::gbbq_str_defined_in_crate    ... bench:  27,713,700 ns/iter (+/- 2,540,403)
test gbbq_async::gbbq_string                  ... bench:  67,637,128 ns/iter (+/- 3,493,143)
test gbbq_async::gbbq_string_defined_in_crate ... bench:  28,840,113 ns/iter (+/- 2,735,867)

Let me explain some contexts.

  1. There are two mods in bench file, gbbq and gbbq_async . The read fn in gbbq mod uses std::fs::read and that in gbbq_async mod uses tokio::fs::read .
  2. And I write exactly the same codes in my crate and benches/xx.rs file, which you can tell from the suffix. The fn name with defined_in_crate suffix uses two iters defined in my crate, and that without defined_in_crate suffix uses the copied version in the bench file. And the iters yield identically except for one field in the yielded struct, for I want to bench the &str vs String field. The yielded structs are like this:
pub struct GbbqStr<'a> {
    s: &'a str,
    ..
}

pub struct GbbqString {
    s: String,
    ..
}

Here is the bench code if you want to have a look.

mod gbbq {
    use super::*;

    #[cfg(feature = "bench-test")]
    #[bench]
    fn gbbq_string_defined_in_crate(b: &mut Bencher) {
        b.iter(|| {
             data()[4..].chunks_exact(29)
                        .map(parse)
                        .map(|b| GbbqStringTDX::from_bytes(&b))
                        .last()
         })
    }

    #[bench]
    fn gbbq_string(b: &mut Bencher) {
        b.iter(|| {
             data()[4..].chunks_exact(29)
                        .map(parse)
                        .map(|b| GbbqString::from_bytes(&b))
                        .last()
         })
    }

    #[bench]
    fn gbbq_str(b: &mut Bencher) {
        b.iter(|| {
             GbbqStr::iter(&mut data()[4..]).last();
         })
    }

    #[bench]
    fn gbbq_str_defined_in_crate(b: &mut Bencher) {
        b.iter(|| {
             Gbbq::iter(&mut data()[4..]).last();
         })
    }
}

#[cfg(feature = "tokio")]
#[cfg(test)]
mod gbbq_async {
    use super::*;

    #[cfg(feature = "bench-test")]
    #[bench]
    fn gbbq_string_defined_in_crate(b: &mut Bencher) {
        b.iter(|| {
             rt().block_on(async { GbbqsStringTDX::from_file("assets/gbbq").await.unwrap().last() })
         })
    }

    #[bench]
    fn gbbq_string(b: &mut Bencher) {
        b.iter(|| {
             rt().block_on(async { GbbqsString::from_file("assets/gbbq").await.unwrap().last() })
         })
    }

    #[bench]
    fn gbbq_str(b: &mut Bencher) {
        b.iter(|| {
             rt().block_on(async {
                     let mut vec = GbbqStr::read_from_file("assets/gbbq").await.unwrap();
                     GbbqStr::iter(&mut vec[4..]).last();
                 })
         })
    }

    #[bench]
    fn gbbq_str_defined_in_crate(b: &mut Bencher) {
        b.iter(|| {
             rt().block_on(async {
                     let mut vec = Gbbq::read_from_file("assets/gbbq").await.unwrap();
                     Gbbq::iter(&mut vec[4..]).last();
                 })
         })
    }
}

The point is that I find the exactly same (with same iter and same yielded item) but located separately codes perform so distinct. Below is a concise table derived from the result, where you would see three pairs of counterparts.

image-20210821010335-71jtrvc

I didn't change the way benchmark works. It's weird Rust treats a bench file different from the lib file, since both should have been optimised by default. But now it seems Rust truly optimised the codes in lib, and left the codes in benches dir less optimised.

As I stated earlier, I meant to bench the performance on &str vs String field. So maybe it's not a good idea to put my old data structs in a bench file. Instead use #[cfg(feature = "bench-test")] or #[cfg(feature = "bench-old")] before a specified mod lying in my crate , from where bench codes invoke with features flags.

The benches/xx.rs file is compiled as a binary crate which links to your library crate. Because Rust crates are separately compiled, this can inhibit certain optimizations. Two ways to regain lost optimization potential are:

  • Use the #[inline] attribute to enable cross-crate inlining. In this case, I would try adding the attribute to the GbbqString::from_bytes, parse, and/or data functions.
  • Enable link-time optimization (LTO) for the release and bench profiles in Cargo.toml. This enables whole-program optimization, but it can also make builds take a very long time for large projects.
3 Likes

Cool!

I turn the lto on, and the result looks acceptable.

test gbbq::gbbq_str                           ... bench:  26,985,034 ns/iter (+/- 25,430,328)
test gbbq::gbbq_str_defined_in_crate          ... bench:  25,645,987 ns/iter (+/- 2,551,203)
test gbbq::gbbq_string                        ... bench:  28,644,978 ns/iter (+/- 1,935,411)
test gbbq::gbbq_string_defined_in_crate       ... bench:  28,553,573 ns/iter (+/- 2,462,146)

test gbbq_async::gbbq_str                     ... bench:  26,516,911 ns/iter (+/- 2,379,532)
test gbbq_async::gbbq_str_defined_in_crate    ... bench:  25,701,251 ns/iter (+/- 4,469,176)
test gbbq_async::gbbq_string                  ... bench:  29,531,931 ns/iter (+/- 1,735,944)
test gbbq_async::gbbq_string_defined_in_crate ... bench:  27,554,222 ns/iter (+/- 2,112,521)

Then I make the parse fn inlined, because it's CPU-intensive. And what surprises me is the codes defined in the bench file might be a bit faster than that defined in the lib.

test gbbq::gbbq_str                           ... bench:  25,346,134 ns/iter (+/- 3,117,684)
test gbbq::gbbq_str_defined_in_crate          ... bench:  25,371,048 ns/iter (+/- 2,040,108)
test gbbq::gbbq_string                        ... bench:  28,372,116 ns/iter (+/- 1,456,295)
test gbbq::gbbq_string_defined_in_crate       ... bench:  28,447,700 ns/iter (+/- 9,565,707)

test gbbq_async::gbbq_str                     ... bench:  25,504,166 ns/iter (+/- 827,544)
test gbbq_async::gbbq_str_defined_in_crate    ... bench:  25,752,713 ns/iter (+/- 1,465,699)
test gbbq_async::gbbq_string                  ... bench:  28,876,222 ns/iter (+/- 1,363,418)
test gbbq_async::gbbq_string_defined_in_crate ... bench:  27,699,906 ns/iter (+/- 1,861,708)

Thanks for the advice. I appreciate it ~


A little puzzle hit me that what's the difference between opt-level and lto ? Forgive me for the unfamiliarity with C and the knowledge of low level aspect.

"opt-level" changes what optimization passes are enabled. At lower opt-levels, some optimizations are disabled because they take a long time, or because they interfere with debuggers.

"lto" changes when optimization happens. Normally, each crate is compiled and optimized individually, and then they are linked together. But with LTO turned on, the optimization happens at the end, during linking, so it can run on all the crates combined.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.