Hi everyone,
today I did some benchmark testing to see how some high performance optimizations I learned in C work in Rust. Here I will give you a survey of my results.
General testing
Since my operations are very short, I run the operations 1000 times in each Bencher.iter(). This means that the resulting times are much higher than the true execution times. Further on I had to do some collectioning of results to prevent the compiler from optimizing the benches away. All in all the timings shouldn't be take as absolute but they are valid for comparing different implementations.
Halve of an i32
A common task is to calculate the half of an integer. I tested three ways of doing this:
- Division by 2
pub fn halve_int_by_div(i: i32) -> i32 { i / 2 }
- Multiply with 0.5 (requires cast to f64)
pub fn halve_int_by_mult(i: i32) -> i32 { ((i as f64) * 0.5) as i32 }
- Shift the bits by 1
pub fn halve_int_by_shift(i: i32) -> i32 { i >> 1 }
The timings are nearly as I espected:
test test::halve_int_div ... bench: 216 ns/iter (+/- 11)
test test::halve_int_mult ... bench: 1,000 ns/iter (+/- 125)
test test::halve_int_shift ... bench: 123 ns/iter (+/- 6)
The shift is the fastest way but one should be aware that this hack only works for powers of two! The casting between binary and IEEE-format-double requires additional calculation which I guess is the reason for the heavy cost of multiplication.
Conclution: Personally I had thought that the compiler notices a division by a power of two and optimizes the / 2 to >> 1 but obviously it doesn't. I guess this is a matter of future compiler optimizations.
Division of f64
The next test I run was about dividing floating point numbers:
- Direct division
pub fn halve_dbl_by_div(d: f64) -> f64 { d / 2.0 }
- Division by reciprocal
pub fn halve_dbl_by_mult(d: f64) -> f64 { d * 0.5 }
We all know that dividing is a more expensive task than multiplying. The bench proved me wrong:
test test::halve_dbl_div ... bench: 232 ns/iter (+/- 10)
test test::halve_dbl_mult ... bench: 232 ns/iter (+/- 13)
I guess here the compiler did some optimization. I made the same operation for calculating a hundreth of a number:
test test::hundreth_dbl_div ... bench: 2,658 ns/iter (+/- 150)
test test::hundreth_dbl_mult ... bench: 232 ns/iter (+/- 16)
Ah that's what I wanted to see.
Conclution: Further test have shown that divition by a power of 2 is always as fast as the multiplication with its reciprocal. If the divisor is not a power of two multiplication with the reciprocal is about 10 times faster than the division. But be aware that the calculation of reciprocals takes time too.
Test if i32 is odd
For testing if an integer is odd I know two methods:
- Modulo 2
pub fn is_odd_by_modulo(i: i32) -> bool { i % 2 != 0 }
- Masking the last bit
pub fn is_odd_by_mask(i: i32) -> bool { i & 0x1 != 0 }
The compiler does in this case a good job:
test test::is_odd_mask ... bench: 223 ns/iter (+/- 12)
test test::is_odd_modulo ... bench: 223 ns/iter (+/- 10)
Both functions show the same execution time. But I think the first one is more readable
Ring counter
The last test I made is implementing a ring counter. This is a counter that counts up to a maximum and then restarts at 0.
My implementation counts from 0 up to 7:
pub const MAX_CNT: u32 = 0b0111; // 7
pub const MAX_CNT_INC: u32 = 0b1000; // 8
I tested three implementations:
- Checking the upper value with an if
pub fn cnt_step_by_if(cnt: u32) -> u32 { if cnt >= MAX_CNT { 0 } else { cnt + 1 } }
- Masking the lower bytes (works only for MAX_CNT = 2^n - 1)
pub fn cnt_step_by_mask(cnt: u32) -> u32 { (cnt + 1) & MAX_CNT }
- Using modulo MAX_CNT + 1
pub fn cnt_step_by_modulo(cnt: u32) -> u32 { (cnt + 1) % MAX_CNT_INC }
Here the timings:
test test::ring_cnt_if ... bench: 469 ns/iter (+/- 25)
test test::ring_cnt_mask ... bench: 56 ns/iter (+/- 3)
test test::ring_cnt_modulo ... bench: 56 ns/iter (+/- 3)
Masking and modulo are 9 times faster than if. But be aware this is only the case if MAX_CNT is (2^n -1). Following the results with MAX_CNT = 10;
test test::ring_cnt_if ... bench: 487 ns/iter (+/- 38)
test test::ring_cnt_modulo ... bench: 3,041 ns/iter (+/- 139)
As you can see the execution time of if stayed nearly the same but modulo without a power of two taskes 6 times longer than the if!
Conclution: If it is possible use a MAX_CNT of (2^n - 1) and use the mask-method. If this can't be done or the MAX_CNT is unknown at compile time use the if-method.
Summary
My tests have shown that a lot of execution time can be saved by using the binary-system. The Rust compiler does mostly a good job in optimizing.
Have you any further hacks like these? Or have you solutions that are faster than mine? Don't hesitate and post them!