Fast `powi` function

I think there is a lot of room to make the powi function a lot quicker, at least for exponents that are u8 and i8. When I benchmark the code below it is significantly faster than powi. I know powi takes an i32, but honestly who is raising a number to a power greater than 127 or less than -128?

trait ExtendedMath {
    fn powu8(self, n: u8) -> Self;
}

impl ExtendedMath for f64 {
    fn powu8(mut self, n: u8) -> Self {
        let mut prod = if n & 1 == 0 { 1.0 } else { self };
        for bit in [2, 4, 8, 16, 32, 64, 128] {
            self *= self;
            if n & bit != 0 {
                prod *= self
            }
        }
        prod
    }
}

I think there is room to improve this function. It is not too crazy to exhaustively match all the u8 exponents with their optimal solution, see the sample below. There is also a good example of this here, fast_ipow.rs. It would be long, but I think it should execute quickly. The only thing I am not sure of is how quick a match statement would be in this case. Does the exponent need to compare to each match branch, or can it do some memory address magic to jump right to the matching u8? Alternately, I think I could put all the optimal solutions in a static array of closures and index into the needed solution. If this is a better way, then I think the array should be static or maybe const. I hear you should go with const if you can, but I also hear const get inlined everywhere and I am a little scared of accidentally inlining a big array in a bunch of places I didn't mean to.

fn powu8(self, n: u8) -> Self {
    match n {
        ...
        3 => self * self * self,
        4 => {
            let n_2 = n * n;
            n_2 * n_2
        }
        ...
    }
}

Thoughts?

Please share the details of how you're running the benchmark.

2 Likes

I use criterion and use a blackbox on the inputs. I make sure to test trivial cases like 0, 1, 2, because the compiler can sometimes optimize those. I don't have the benchmark code on my personal computer, but the results are about 2-3 ns for powu8 and 13-17 ns for powi. I know these are such short durations that there is a lot of noise, but I have also seen similar improvements with larger integrated benchmarks. Those integrated benches on large multi-dimensional polynomials went from 25 us to 2 us.

Here are some more details on the benchmarking. The benchmarks are a little slower on my personal computer. Benchmark code:

use std::hint::black_box;
use criterion::{Criterion, criterion_group, criterion_main};

fn fast_powu8(mut base: f64, exp: u8) -> f64 {
    let mut prod = if exp & 1 == 0 { 1.0 } else { base };
    for bit in [2, 4, 8, 16, 32, 64, 128] {
        base *= base;
        if exp & bit != 0 {
            prod *= base
        }
    }
    prod
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fast 63", |b| b.iter(|| fast_powu8(black_box(3.16), black_box(63))));
    c.bench_function("std 63", |b| b.iter(|| black_box(3.16f64).powi(black_box(63))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Results:

Benchmarking fast 63: Collecting 100 samples in estimated 5.0000 s (1.6B iteratio
fast 63                 time:   [3.0420 ns 3.0482 ns 3.0570 ns]
Found 12 outliers among 100 measurements (12.00%)
  12 (12.00%) high severe

Benchmarking std 63: Collecting 100 samples in estimated 5.0001 s (260M iteration
std 63                  time:   [18.588 ns 18.631 ns 18.679 ns]
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  11 (11.00%) high severe

Well if you look at the LLVM-IR https://rust.godbolt.org/z/jzM3qWPYz, you'll see rust just emits

tail call float @llvm.powi.f32.i32

aka https://llvm.org/docs/LangRef.html#llvm-powi-intrinsic.

Which seems like if it's not the best option, that would be an LLVM bug instead, or maybe a compiler-builtins one? But the latter seems to be using the same algorithm you mention compiler-builtins/compiler-builtins/src/float/pow.rs at eba1a3f3d2824739f07fe434624b0d2fee917d89 · rust-lang/compiler-builtins · GitHub

(I wonder if what you're actually asking for here is better LTO with wherever it's implemented, or better PGO optimizing of what the exponent actually is.)

1 Like

Thank you, this is really interesting information. I copies the compiler-builtins powi function (copied std below) and got significantly faster results than what f64::powi provides from std, see below. I also changes my benchmark to iterate through several values of base and exponent. Not quite as fast as my version (loop below) that uses a u8 exponent, but i guess that is the cost to working with a larger signed integer type.

Benchmarking loop: Collecting 100 samples in estimated 5.0001 s (
loop                    time:   [30.874 ns 30.966 ns 31.075 ns]  
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Benchmarking copied std: Collecting 100 samples in estimated 5.00
copied std              time:   [48.974 ns 49.264 ns 49.621 ns]  
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmarking std: Collecting 100 samples in estimated 5.0006 s (2
std                     time:   [182.53 ns 184.65 ns 187.00 ns]  
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

I don't know what "LTO" or "PGO optimizing" is, but it seems weird that a hand written version of powi would be so much faster than what you get from the standard library.

The hand-written version here is compiled from source, but the std ships pre-compiled, so some optimizations might be missed.

LTO and PGO stand for Link Time and Profile Giodet optimizations respectively.

1 Like

Is there a way to fix the LTO or PGO? I sounds like it is a Rust std issue, but is there a practical solution to making it faster, either by making a change on the Rust std or on the user side?

To put a bow on my original post, I benchmarked the array and match statement concepts I had above. Here are the results.

match                   time:   [1.2629 µs 1.2761 µs 1.2948 µs]
Found 18 outliers among 100 measurements (18.00%)
  18 (18.00%) high severe

static                  time:   [660.88 ns 681.71 ns 705.91 ns]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

const                   time:   [612.97 ns 615.55 ns 618.87 ns]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

loop                    time:   [1.1511 µs 1.1662 µs 1.1867 µs]
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

builtin                 time:   [2.4178 µs 2.4303 µs 2.4430 µs]
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe

std                     time:   [6.1450 µs 6.1663 µs 6.1949 µs]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

All of these evaluate f64 to the power of an i8. The match benchmark has a match statement that matches on every i8. The const and static benchmarks have fixed length arrays of closures (no captured variables, so fn type, not impl Fn) and are are indexed (unchecked) into by i8 as u8 as usize. The only difference is that one uses a static array and the other uses a const array. The loop benchmark is the same as the initial post. The builtin is the one linked by @scottmcm here. And std is the standard library powi function for a f64.

The method for the match, const, and static is to calculate the power using the "shortest addition chain", see here. This doesn't actually use addition to calculate the powers, but the law of exponents. I utilized macros so all the methods used the same "addition chains".

const

pub fn const_array_powi8(base: f64, exp: i8) -> f64 {
    const POW: [fn(f64) -> f64; 256] = pow_array_i8!();
    unsafe{POW.get_unchecked(exp as u8 as usize)(base)}
}

match

pub fn match_powu8(n: f64, exp: u8) -> f64 {
    match exp {
        0 => pow!(n^0),
        1 => pow!(n^1),
        ...
        255 => pow!(n^255),
    }
}

The macros

macro_rules! pow {
    ($n:ident^0) => {1.0};
    ($n:ident^1) => {$n};
    ($n:ident^2) => {$n * $n};
    ($n:ident^3) => {$n * $n * $n};
    ($n:ident^4) => {{
        let n2 = $n * $n;
        n2 * n2
    }};
    ...
    ($n:ident^254) => {{
        let n127 = pow!($n^127);
        n127 * n127
    }};
    ($n:ident^255) => {{
        let n85 = pow!($n^85);
        n85 * n85 * n85
    }};
}
macro_rules! pow_array_i8 {
    ()=> {
        [
            |_n| pow!(n^0),
            |n| pow!(n^1),
            ...
            |n| pow!(n^2).recip(),
            |n| pow!(n^1).recip(),
        ]
    }
}