Help comparing Rust vs Julia speed

Hi all,

I got into a discussion with some Actuaries on different implementations to a simple Actuarial modeling problem. We all implemented a simple 10-period net present value. Here is my simple implementation:

pub fn npv(mortality_rates: &Vec<f64>, lapse_rates: &Vec<f64>, interest_rate: f64, sum_assured: f64, premium: f64, init_pols: f64, term: Option<usize>) -> f64 {

    let term = term.unwrap_or_else(|| mortality_rates.len());
    let mut result = 0.0;
    let mut inforce = init_pols;
    let v: f64 = 1.0 / (1.0 + interest_rate);

    for (t, (q, w)) in mortality_rates.iter().zip(lapse_rates).enumerate() {
        let no_deaths = if t < term {inforce * q} else {0.0};
        let no_lapses = if t < term {inforce * w} else {0.0};
        let premiums = inforce * premium;
        let claims = no_deaths * sum_assured;
        let net_cashflow = premiums - claims;
        result += net_cashflow * v.powi(t as i32);
        inforce = inforce - no_deaths - no_lapses;
    }

    result
}

I bench marked it and it ran in 364.35 ns.

Another person re-implemented my version in Julia:

function npv3(q,w,P,S,r,term=nothing)
	term = isnothing(term) ? length(q) + 1 : term + 1
	inforce = 1.0
	result = 0.0
	
	for (t,(q,w)) in enumerate(zip(q,w))
		deaths = t < term ? inforce * q : 0.0
		lapses = t < term ? inforce * w : 0.0
		premiums = inforce * P
		claims = deaths * S
		ncf = premiums - claims
		result += ncf * (1 / ( 1 + r)) ^ (t-1)
		inforce = inforce - deaths - lapses
	end
	
  	return result
end

This user bench marked his version and found that it ran in 135 ns. This is not really the problem to decide between Rust vs Julia but it is interesting. He is using an M1 Mac where Julia runs on Rosetta (I'm on windows 10) so it's hard to compare these two implementations in a meaningful way.

My question is, is there anything obvious that I am missing with the Rust implemenation that could improve it? If anyone has an M1 Mac and could bench mark on their machine that would be really interesting.

A small change that really ought not to matter, but sometimes does:

-pub fn npv(mortality_rates: &Vec<f64>, lapse_rates: &Vec<f64>, interest_rate: f64, sum_assured: f64, premium: f64, init_pols: f64, term: Option<usize>) -> f64 {
+pub fn npv(mortality_rates: &[f64], lapse_rates: &[f64], interest_rate: f64, sum_assured: f64, premium: f64, init_pols: f64, term: Option<usize>) -> f64 {

And even if it doesn't improve speed, it's better style -- it's more general as it allows passing subslices instead of requiring full vectors, without really any cost (not even syntax at the call site, as &Vec<T> will deref-coerce to &[T] automatically).

Can you share how you got the measurement you did? Anything below about 16ms is extremely sensitive to external stuff, and thus rather hard to benchmark reliably. Obligatory mention of Criterion.rs - Criterion.rs Documentation

6 Likes

Thanks @scottmcm.

I actually noticed the &[f64] change after I posted it, it didn't make any impact on performance. I used criterion for the benchmark:

#[macro_use]
extern crate criterion;
use criterion::Criterion;

use npv::npv;


fn add_benchmark(c: &mut Criterion) {
    let mortality_rates = vec![0.001, 0.002, 0.003, 0.003, 0.004, 0.004, 0.005, 0.007, 0.009, 0.011];
    let lapse_rates = vec![0.05, 0.07, 0.08, 0.10, 0.14, 0.20, 0.20, 0.20, 0.10, 0.04];
    let premium = 100.0;
    let sum_assured = 25000.0;
    let interest_rate = 0.02;
    c.bench_function("npv", |b| b.iter(|| npv(&mortality_rates, &lapse_rates, interest_rate, sum_assured, premium, 1.0, None)));
}

criterion_group!(benches, add_benchmark);
criterion_main!(benches);

I got it down to the 277 ns range by switching to f32 throughout.

1 Like

You should probably run the two programs on the same machine to get comparable benchmarks first. Preferably a machine without much else running at the same time. (edit : and of course don't forget the --release flag ?)

6 Likes

Seconding that--it took me an amusingly long time to realize that I had to use the --release flag to enable optimizations. You are doing that, right?

1 Like

Yes, I'm doing that. I'll have to install Julia and re-run his benchmarks to be able to compare I guess. I was just checking that I wasn't doing anything obviously wrong.

Thank you everyone

Yeah, seems like you've avoided a bunch of common mistakes, like including the vec construction in what's being benchmarked :slight_smile:

3 Likes

I just benched both (Dell XPS13 9350 running Arch Linux):

  • Rust: 76ns
  • Julia: 184ns
4 Likes

Ahh, interesting. Thank you.

If you know you'll always have sane (i.e. not NaN) inputs and are okay with slightly less accurate results, you can always use the Rust equivalent of the -ffast-math flag provided by most C compilers. These "fast" floating point operations are exposed using intrinsics like std::intrinsics::fadd_fast() and give the compiler more flexibility in reordering operations.

Crates like fast-floats have wrappers around f32/f64 which overload the normal arithmetic operators to use the fast float intrinsics.

3 Likes

In addition to fast math (to enable fused multiply-add and associativity, plus other transformations that may be unsafe/undesirable), I'd also recommend using native compilation flags, as in

RUSTFLAGS='-C opt-level=3 -C target-cpu=native' cargo build --release

Julia isn't making a redistributable binary so it'll JIT for the native architecture.

4 Likes

I profiled the implementation and found that powi was taking most of the time. I changed to the following impl:

pub fn npv(mortality_rates: &[f64], lapse_rates: &[f64], interest_rate: f64, sum_assured: f64, premium: f64, init_pols: f64, term: Option<usize>) -> f64 {

    let term = term.unwrap_or_else(|| mortality_rates.len());
    let mut result = 0.0;
    let mut inforce = init_pols;
    let v = 1.0 / (1.0 + interest_rate);
    let mut v_t = v;

    for (t, (q, w)) in mortality_rates.iter().zip(lapse_rates).enumerate() {
        let no_deaths = if t < term {inforce * q} else {0.0};
        let no_lapses = if t < term {inforce * w} else {0.0};
        let premiums = inforce * premium;
        let claims = no_deaths * sum_assured;
        let net_cashflow = premiums - claims;
        result += net_cashflow * v_t;
        v_t *= v;
        inforce = inforce - no_deaths - no_lapses;
    }

    result
}

Now it's down to 31 ns :slight_smile:.

I know it's not having to re-compute the power in each loop as it's taking the previous power from the previous loop but would there be a difference in the impl of powi on Arch vs windows as well?

I tried a few different setups in terms of build flags but it doesn't seem to make much difference. Interesting...

Hmm, I wonder if this is because of float non-associativity.

powi is an LLVM intrinsic https://rust.godbolt.org/z/jsdMex that calls a function from compiler-rt, which is (if we're using the C implementations and not rust ones, which I don't remember) implemented like this:

That's really good for computing it once, but because of floating point it's technically not allowed for the compiler to replace that with the running version in your alternative implementation, which can accumulate more error.

That said, the intrinsic's definition does say

The order of evaluation of multiplications is not defined.

So maybe it's supposed to be able to do this anyway?

🤷

Also for me it went down to 31ns.

Interesting benchmark.

I tried rewriting it into a more functional approach with iterators and (unless I've made a mistake with the code) that improves the speed as well.

With the code above, I get this result from criterion:

npv                     time:   [37.158 ns 37.368 ns 37.595 ns]                 
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
pub fn npv(
    mortality_rates: &[f64],
    lapse_rates: &[f64],
    interest_rate: f64,
    sum_assured: f64,
    premium: f64,
    init_pols: f64,
    term: Option<usize>,
) -> f64 {
    let term = term.unwrap_or_else(|| mortality_rates.len());
    let v = 1.0 / (1.0 + interest_rate);

    mortality_rates
        .iter()
        .zip(lapse_rates)
        .enumerate()
        .fold((0.0, v, init_pols), |(accum, v_t, inforce), (t, (q, w))| {
            let no_deaths = if t < term { inforce * q } else { 0.0 };
            let no_lapses = if t < term { inforce * w } else { 0.0 };
            let premiums = inforce * premium;
            let claims = no_deaths * sum_assured;
            let net_cashflow = premiums - claims;
            (
                accum + net_cashflow * v_t,
                v_t * v,
                inforce - no_deaths - no_lapses,
            )
        })
        .1
}

and the result with criterion:

npv                     time:   [8.0005 ns 8.1359 ns 8.2846 ns]                 
                        change: [-77.409% -77.077% -76.736%] (p = 0.00 < 0.05)
                        Performance has improved.

Please tell me that I've made a mistake, I didn't expect that much difference from this rewrite.

1 Like

An easy way to check if your refactoring changed the output would be to fuzz the old and new implementation against each other. (though I think refactoring the code was a bit out of scope here, since the goal was to compare the exact same algorithm ^^)

Sorry but you did, after the fold you select the second value of the result (equivalent to v_t), but you actually needed to select the first by using .0 instead of .1. Since v_t is only multiplied by a constant the compiler can remove all the other stuff, which of course is much faster.

Can I just point out that it's absolutely awesome that someone goes "hey, why is this code slow?", and the community spends over a week optimizing the living hell out of it just to show that it's possible?

6 Likes