High performance operations

Hi everyone,

today I did some benchmark testing to see how some high performance optimizations I learned in C work in Rust. Here I will give you a survey of my results.

General testing

Since my operations are very short, I run the operations 1000 times in each Bencher.iter(). This means that the resulting times are much higher than the true execution times. Further on I had to do some collectioning of results to prevent the compiler from optimizing the benches away. All in all the timings shouldn't be take as absolute but they are valid for comparing different implementations.

Halve of an i32

A common task is to calculate the half of an integer. I tested three ways of doing this:

  1. Division by 2
    pub fn halve_int_by_div(i: i32) -> i32 { i / 2 }
    
  2. Multiply with 0.5 (requires cast to f64)
    pub fn halve_int_by_mult(i: i32) -> i32 { ((i as f64) * 0.5) as i32 }
    
  3. Shift the bits by 1
    pub fn halve_int_by_shift(i: i32) -> i32 { i >> 1 }
    

The timings are nearly as I espected:

test test::halve_int_div     ... bench:         216 ns/iter (+/- 11)
test test::halve_int_mult    ... bench:       1,000 ns/iter (+/- 125)
test test::halve_int_shift   ... bench:         123 ns/iter (+/- 6)

The shift is the fastest way but one should be aware that this hack only works for powers of two! The casting between binary and IEEE-format-double requires additional calculation which I guess is the reason for the heavy cost of multiplication.

Conclution: Personally I had thought that the compiler notices a division by a power of two and optimizes the / 2 to >> 1 but obviously it doesn't. I guess this is a matter of future compiler optimizations.

Division of f64

The next test I run was about dividing floating point numbers:

  1. Direct division
    pub fn halve_dbl_by_div(d: f64) -> f64 { d / 2.0 }
    
  2. Division by reciprocal
    pub fn halve_dbl_by_mult(d: f64) -> f64 { d * 0.5 }
    

We all know that dividing is a more expensive task than multiplying. The bench proved me wrong:

test test::halve_dbl_div     ... bench:         232 ns/iter (+/- 10)
test test::halve_dbl_mult    ... bench:         232 ns/iter (+/- 13)

I guess here the compiler did some optimization. I made the same operation for calculating a hundreth of a number:

test test::hundreth_dbl_div  ... bench:       2,658 ns/iter (+/- 150)
test test::hundreth_dbl_mult ... bench:         232 ns/iter (+/- 16)

Ah that's what I wanted to see.

Conclution: Further test have shown that divition by a power of 2 is always as fast as the multiplication with its reciprocal. If the divisor is not a power of two multiplication with the reciprocal is about 10 times faster than the division. But be aware that the calculation of reciprocals takes time too.

Test if i32 is odd

For testing if an integer is odd I know two methods:

  1. Modulo 2
    pub fn is_odd_by_modulo(i: i32) -> bool { i % 2 != 0 }
    
  2. Masking the last bit
    pub fn is_odd_by_mask(i: i32) -> bool { i & 0x1 != 0 }
    

The compiler does in this case a good job:

test test::is_odd_mask       ... bench:         223 ns/iter (+/- 12)
test test::is_odd_modulo     ... bench:         223 ns/iter (+/- 10)

Both functions show the same execution time. But I think the first one is more readable

Ring counter

The last test I made is implementing a ring counter. This is a counter that counts up to a maximum and then restarts at 0.
My implementation counts from 0 up to 7:

pub const MAX_CNT: u32     = 0b0111; // 7
pub const MAX_CNT_INC: u32 = 0b1000; // 8

I tested three implementations:

  1. Checking the upper value with an if
    pub fn cnt_step_by_if(cnt: u32) -> u32 {
       if cnt >= MAX_CNT { 0 } 
       else { cnt + 1 }
    }
    
  2. Masking the lower bytes (works only for MAX_CNT = 2^n - 1)
    pub fn cnt_step_by_mask(cnt: u32) -> u32 { (cnt + 1) & MAX_CNT }
    
  3. Using modulo MAX_CNT + 1
    pub fn cnt_step_by_modulo(cnt: u32) -> u32 { (cnt + 1) % MAX_CNT_INC }
    

Here the timings:

test test::ring_cnt_if       ... bench:         469 ns/iter (+/- 25)
test test::ring_cnt_mask     ... bench:          56 ns/iter (+/- 3)
test test::ring_cnt_modulo   ... bench:          56 ns/iter (+/- 3)

Masking and modulo are 9 times faster than if. But be aware this is only the case if MAX_CNT is (2^n -1). Following the results with MAX_CNT = 10;

test test::ring_cnt_if       ... bench:         487 ns/iter (+/- 38)
test test::ring_cnt_modulo   ... bench:       3,041 ns/iter (+/- 139)

As you can see the execution time of if stayed nearly the same but modulo without a power of two taskes 6 times longer than the if!

Conclution: If it is possible use a MAX_CNT of (2^n - 1) and use the mask-method. If this can't be done or the MAX_CNT is unknown at compile time use the if-method.

Summary

My tests have shown that a lot of execution time can be saved by using the binary-system. The Rust compiler does mostly a good job in optimizing.

Have you any further hacks like these? Or have you solutions that are faster than mine? Don't hesitate and post them!

1 Like

Take a look at the assembly:

example::halve_int_by_div:
        mov     eax, edi
        shr     eax, 31
        lea     eax, [rax + rdi]
        sar     eax
        ret

example::halve_int_by_shift:
        sar     edi
        mov     eax, edi
        ret

For such tiny functions I think your benchmarking strategy is underpowered (Intel VTune seems OK for your purposes).


If you want to study some more tiny optimizations take a look here, this comes from Rust signed and unsigned integrals semantics:

Rust code:

pub fn foo1(x: u32) -> u32 {
    (x * 2) / 2
}

pub fn foo2(x: u64) -> u64 {
    (x * 2) / 2
}

pub fn foo3(x: i32) -> i32 {
    (x * 2) / 2
}

pub fn foo4(x: i64) -> i64 {
    (x * 2) / 2
}

pub fn foo5(x: u32) -> u32 {
    (x * 3) / 3
}

pub fn foo6(x: u64) -> u64 {
    (x * 3) / 3
}

pub fn foo7(x: i32) -> i32 {
    (x * 3) / 3
}

pub fn foo8(x: i64) -> i64 {
    (x * 3) / 3
}

pub fn foo9(x: u32) -> u32 {
    unsafe { assume(x < 100_000); }
    (x * 3) / 3
}

pub fn foo10(x: u64) -> u64 {
    unsafe { assume(x < 100_000); }
    (x * 3) / 3
}

pub fn foo11(x: i32) -> i32 {
    unsafe { assume(x < 100_000 && x > -100_000); }
    (x * 3) / 3
}

pub fn foo12(x: i64) -> i64 {
    unsafe { assume(x < 100_000 && x > -100_000); }
    (x * 3) / 3
}


Rust asm:

example::foo1:
        and     edi, 2147483647
        mov     eax, edi
        ret

example::foo2:
        mov     al, 63
        bzhi    rax, rdi, rax
        ret

example::foo3:
        lea     eax, [rdi + rdi]
        sar     eax
        ret

example::foo4:
        lea     rax, [rdi + rdi]
        sar     rax
        ret

example::foo5:
        lea     ecx, [rdi + 2*rdi]
        mov     eax, 2863311531
        imul    rax, rcx
        shr     rax, 33
        ret

example::foo6:
        lea     rdx, [rdi + 2*rdi]
        movabs  rax, -6148914691236517205
        mulx    rax, rcx, rax
        shr     rax
        ret

example::foo7:
        lea     eax, [rdi + 2*rdi]
        cdqe
        imul    rax, rax, 1431655766
        mov     rcx, rax
        shr     rcx, 63
        shr     rax, 32
        add     eax, ecx
        ret

example::foo8:
        lea     rax, [rdi + 2*rdi]
        movabs  rcx, 6148914691236517206
        imul    rcx
        mov     rax, rdx
        shr     rax, 63
        lea     rax, [rax + rdx]
        ret

example::foo9:
        mov     eax, edi
        ret

example::foo10:
        mov     rax, rdi
        ret

example::foo11:
        lea     eax, [rdi + 2*rdi]
        cdqe
        imul    rax, rax, 1431655766
        mov     rcx, rax
        shr     rcx, 63
        shr     rax, 32
        add     eax, ecx
        ret

example::foo12:
        lea     rax, [rdi + 2*rdi]
        movabs  rcx, 6148914691236517206
        imul    rcx
        mov     rax, rdx
        shr     rax, 63
        lea     rax, [rax + rdx]
        ret

C code:

#include <stdint.h>
uint32_t foo1(const uint32_t x) {

    return (x * 2) / 2;
}

uint64_t foo2(const uint64_t x) {
    return (x * 2) / 2;
}

int32_t foo3(const int32_t x) {
    return (x * 2) / 2;
}

int64_t foo4(const int64_t x) {
    return (x * 2) / 2;
}

uint32_t foo5(const uint32_t x) {
    return (x * 3) / 3;
}

uint64_t foo6(const uint64_t x) {
    return (x * 3) / 3;
}

int32_t foo7(const int32_t x) {
    return (x * 3) / 3;
}

int64_t foo8(const int64_t x) {
    return (x * 3) / 3;
}


C asm:

foo1(unsigned int):
        mov     eax, edi
        and     eax, 2147483647
        ret
foo2(unsigned long):
        mov     rax, rdi
        btr     rax, 63
        ret
foo3(int):
        mov     eax, edi
        ret
foo4(long):
        mov     rax, rdi
        ret
foo5(unsigned int):
        lea     eax, [rdi+rdi*2]
        mov     edx, 2863311531
        imul    rax, rdx
        shr     rax, 33
        ret
foo6(unsigned long):
        lea     rdx, [rdi+rdi*2]
        movabs  rcx, -6148914691236517205
        mov     rax, rdx
        mul     rcx
        mov     rax, rdx
        shr     rax
        ret
foo7(int):
        mov     eax, edi
        ret
foo8(long):
        mov     rax, rdi
        ret

I don't how much much performance Rust leaves on the table here. But to regain the C performance in Rust we could need in stdlib yet another wrapper type like std::num::Wrapping, that acts like C integrals operations.

1 Like

The problem is that the shift function is wrong for negative values. There are a lot of situations in which the optimizer is better than humans, and when we think we are doing it better, it is important to double (triple and quadruple :yum:) check for errors and UBs.

If you want to perform some comparisons between C/C++ compiler optimizations and what we can get in Rust, IHMO the best thing to do is compare assembly. Indeed, in this case the codegen is the same for both Rust and C, using both gcc and clang for the latter.

1 Like

You are absolutly right. Changing the signatures to u32 resulted in same execution time for /2 and >>1 :+1:.

I think this is a good example why signed integers should only be used if negeativ values are required.

I'm also enjoying the generated code for the last two functions:

example::cnt_step_by_mask:
        lea     eax, [rdi + 1]
        and     eax, 7
        ret

example::cnt_step_by_modulo:
        jmp     example::cnt_step_by_mask@PLT

:slight_smile:

1 Like

I wanted to play a bit more with your tests.

For the ring counter test, I think that the major issue is that ring_cnt_if works differently from the other two (which are just aliased from the compiler because their codegen is the same). Indeed, the behaviour changes when you pass values greater than MAX_CNT as parameter, as you can see here.

The division test is interesting, because using division instead of multiplications leads to slightly different results. Honestly, in this case I am not sure in which cases one approach should be avoided (other than speed). Maybe someone more expert in this field could give some suggestions.

1 Like

LLVM is incredibly good at tiny little things like this. As other posts have mentioned, if you're not seeing that kind of thing happen, it's probably because the different versions are actually doing different things.

It'll even do things you've probably never thought about, like replacing x / 5_u32 with (((x as u64) * 3435973837) >> 34) as u32: Compiler Explorer

Edit: Here's another quick demo of 4 different things that are all turned into the most efficient version by LLVM: Compiler Explorer

And reciprocals don't give exactly the same answer, in general, as @dodomorandi demonstrated, which is why LLVM doesn't always do this for division by constants (unless you specify arcp on the LLVM instruction, see LLVM Language Reference Manual — LLVM 18.0.0git documentation).

You might be interested in

1 Like