One useless line increasing throughput by 50%!?

I am developing a fairly standard buffered binary scanner for a custom format.

During development, I experimented with different approaches to optimize for throughput while skipping chunks.

I found one particular implementation was beating the others by a large margin.

So I polished it a bit. But then the performance regressed when I removed a code path that was not even being executed in the tests.

I was able to pin it down to a line that was allocating a Vec through the vec![] macro.

This line originally was in fn skip() which is called from fn next(). But I found moving it anywhere within the Scanner impl was enough to keep the throughput gain, to the point it does nothing and now sits at the constructor and is probably optimized away.

impl<R: BufRead> Scanner<R> {
    pub fn new(reader: R) -> Self {
        // Remove this line and throughput tanks! 
        let _ = vec![0;0];

        Scanner {
            reader,
            buffer: vec![0; BUFSIZE],
            bufpos: 0,
            buflen: 0,
            header: None,
            hasval: false,
        }
    }

Despite it clearly doing nothing, by removing it we go from…

jon@jonmbpp fcodec λ cargo run --release --bin bench_bin_scan /tmp/fifo.fifob
    Finished `release` profile [optimized] target(s) in 0.00s
     Running `target/release/bench_bin_scan /tmp/fifo.fifob`
file: /tmp/fifo.fifob
bytes: 429496729
rows: 26843546
rows/s: 354419760.06
time: 0.076s
throughput: 5.28 GiB/s (5.67 GB/s)

to…

jon@jonmbpp fcodec λ cargo run --release --bin bench_bin_scan /tmp/fifo.fifob
    Finished `release` profile [optimized] target(s) in 0.00s
     Running `target/release/bench_bin_scan /tmp/fifo.fifob`
file: /tmp/fifo.fifob
bytes: 429496729
rows: 26843546
rows/s: 225853748.59
time: 0.119s
throughput: 3.37 GiB/s (3.61 GB/s)

I know how sensitive high-performance code is, and that there's a myriad of things that could impact the bottom line, but this is illogical to me.

1 Like

I'm as puzzled as you, that seems odd!

Out of curiosity, does the behavior change between using any of the 3 branches of the vec! macro to build the empty vector?

1 Like

Can you provide us with an MCVE?

Reminder that a zero-size vector doesn't even allocate, so that line is doing nothing at all: https://rust.godbolt.org/z/TPd3Wv3eP

What you might actually be doing is just affecting the inlining heuristics, so you might want instead to try #[inline] or #[inline(never)] on that function. And LTO if you're not already.

But also it's entirely possible for exactly the same code to have a 50% performance difference depending whether ASLR puts it on an extra-aligned address or not, so you might also just be testing something too small to be meaningful.

11 Likes

Tested again hundreds of times, just to make sure I'm not crazy. Happens the same whether on nighly or stable toolchain.

let _: Vec<u8> = vec![]; // same as nothing 
let _ = vec![0;0]; // changes performance 
let _ = vec![0u8;0]; // same as nothing 
let _ = vec![0u8, 0u8, 0u8]; // same as nothing 

What about i32/i64/u32/u64?

When something is illogical, that means nothing you tell us about the results will be useful. We need exactly what's reproing it for you, or we can't help.

u8 is the slow one, anything beyond it yields faster throughput, interesting!

I'm trying to provide that. It's not something easy to isolate

How many Scanners are created during a test run?

Can you run it under perf record, then annotate the hotspots in perf report? There might be some interesting difference in the assembly between the two versions.

A single Scanner, but I guess this does not matter anymore, see below.

During the process of synthesizing an example, I ran the benchmark as I was stripping the surrounding code to the bare essentials. What I found is that by changing other files within the program I was able to cancel the throughput gain "afforded" by the let _ = vec![0;0] line.

So looking in the profiler as suggested by @cuviper I noticed that the fast version had inlined fn next() into the runner, recording only the fn skip(), while the slow version has not inlined fn next(), recording both functions in the profiler. So I was getting this function inlined automatically by adding the dummy Vec. So forcing #[inline(never)] on fn next() tanks the performance and forcing #[inline(always)] recovers (most) of the performance regardless of any other code. The generated assembly varies due to the layout still not being exactly the same, but that's another problem.

7 Likes

This may be an edge case caused by layout or alignment of code within the executable. Caches and branch prediction memory are using absolute addresses and sometimes code shifting by a few bytes gets a different caching pattern.

14 Likes