How to optimize this function

I was hoping someone who understands assembly could help me understand the performance of following code:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=5362f98d5eba07a9a2a9c9b6f144a6a1

I'm trying to fiddle with and/or optimize the function foo. When running, 99.9% of the time foo is called with N=4 and is fed a single byte argument (the second byte is unused and can be anything). The other 0.1% of the time it is called with N=8 and both byte arguments are used.

Running this as is, I get output of approximately:

N=4    699.654084ms
N=8    1.385663344s

However if I comment out the println statement on line 42 I get:

N=4    31.783319ms
N=8    1.334048695s

I don't understand how N=8 is unchanged, but N=4 gets a 20x speed boost even though the function in question is never called during runtime.

In my actual code, both the alert and get_value functions are more complex and have side effects.

Benchmarking with a simple for loop and a timer is very prone to giving nonsense results.

The optimizer has ability to hoist things out of loops, and reorder code. Speed is not a side-effect that the optimizer has to preserve, so it's free to move slow code before/after your Instant::now()! It's allowed to merge multiple loops into one, etc.

You should use Bencher or criterion. They insert appropriate "black boxes" to stop optimizer from deleting "useless" code and moving things outside of the benchmark loops. Proper benchmark frameworks run tests for long enough for the results to matter. One-shot runs may just suffer from hitting cold cache or measure how long your CPU ramps up speed from being in a lower power state, or how long it can sustain a turbo boost.

2 Likes

I get that for loops aren't very reliable. Bencher requires nightly, correct? I'm only running stable at the moment. I'll check out criterion.

I've gotten very consistent results with many runs across multiple machines, although I take the point of the optimizer being able to reorder things outside the loop making it all suspect.

I could just split this into two functions instead of using const generics, but I hate repeating myself, so I was looking for a way to keep a single definition, but handle both paths efficiently and ergonomically.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.