Performance penalty on array newtype

chjordan · August 18, 2021, 1:23pm

Hi. I'm hoping someone can help me understand if I'm getting confused or I'm leaving some performance on the table.

In my work, I deal a lot with Jones matrices (2x2, complex valued, either 32- or 64-bit precision). A natural thing to do in Rust is:

use num_complex::Complex;
use num_traits::{Float, Num};

#[derive(Clone, Default, PartialEq)]
pub struct Jones<F: Float + Num>([Complex<F>; 4]);

impl<F: Float> Mul<&Jones<F>> for Jones<F> {
    type Output = Self;

    fn mul(self, rhs: &Self) -> Self {
        Self::from([
            self[0] * rhs[0] + self[1] * rhs[2],
            self[0] * rhs[1] + self[1] * rhs[3],
            self[2] * rhs[0] + self[3] * rhs[2],
            self[2] * rhs[1] + self[3] * rhs[3],
        ])
    }
}

// other methods elided

(I've opted not to use the Copy trait, as potentially copying 8 double-precision floats around "felt" like a lot. I've not done benchmarks on that)

I wanted to make sure that the multiplication of two Jones matrices is as good as it should be (i.e. the same as two arrays), but it appears that this is not the case:

use criterion::*;
use num_complex::Complex;

type c64 = Complex<f64>;

fn misc(c: &mut Criterion) {
    c.bench_function("multiply Jones<f64>", |b| {
        let i = c64::new(1.0, 2.0);
        let j1 = Jones::from([i, i + 1.0, i + 2.0, i + 3.0]);
        let j2 = Jones::from([i * 2.0, i * 3.0, i * 4.0, i * 5.0]);
        b.iter(|| {
            let _j3 = j1.clone() * &j2;
        })
    });

    c.bench_function("multiply [c64; 4]", |b| {
        let i = c64::new(1.0, 2.0);
        let j1 = [i, i + 1.0, i + 2.0, i + 3.0];
        let j2 = [i * 2.0, i * 3.0, i * 4.0, i * 5.0];
        let mul = |j1: [c64; 4], j2: &[c64; 4]| {
            black_box([
                j1[0] * j2[0] + j1[1] * j2[2],
                j1[0] * j2[1] + j1[1] * j2[3],
                j1[2] * j2[0] + j1[3] * j2[2],
                j1[2] * j2[1] + j1[3] * j2[3],
            ])
        };
        b.iter(|| {
            let _j3 = mul(j1.clone(), &j2);
        })
    });
}

criterion_group!(benches, misc);
criterion_main!(benches);

Output of cargo bench

multiply Jones<f64>     time:   [6.6112 ns 6.6144 ns 6.6178 ns]
                        change: [-6.7829% -6.6348% -6.4965%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

multiply [c64; 4]       time:   [5.5844 ns 5.5885 ns 5.5925 ns]
                        change: [+1.4145% +1.4973% +1.5759%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

(The variances look scary but they're reasonably consistent; the Jones newtype is consistently ~1ns slower than the direct multiplication).

I need to use criterion's black_box to prevent the array benchmark from appearing to have a runtime of 0s, but I don't know if this is causing other problems.

Is my benchmark flawed? Or is there some way to improve my newtype code? I'd like to improve this as much as I can -- I will be multiplying petabytes of these Jones matrices!

kpreid · August 18, 2021, 1:47pm

I think your benchmarks should pass _j3 to black_box to ensure that its calculation is not at all optimized away. (And remove it from the local mul function since that isn't realistic.)
Check the effect of adding #[inline] to fn mul and fn from to enable cross-crate inlining.
Try implementing Mul<Jones> instead of Mul<&Jones>; indirection has a cost. (But this will probably make no difference if the function is inlined.)
This is a tiny piece of arithmetic, and it may be that the performance difference you're measuring here is completely swamped by its post-optimization relationship to the code around it. Consider writing a bigger benchmark — one which actually uses this multiplication along with other code to compute something realistic.

chjordan · August 19, 2021, 6:45am

Thanks for your input @kpreid ! After playing around with your suggestions, I've arrived at the following:

#[inline] on all Jones methods
Deriving Copy on Jones.

Inlining had the most dramatic effect:

multiply Jones<f64>     time:   [929.12 ps 929.86 ps 930.82 ps]
                        change: [+3.0902% +3.4459% +3.7728%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

multiply [c64; 4]       time:   [927.99 ps 928.74 ps 930.07 ps]
                        change: [+0.0371% +0.1101% +0.1930%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 18 outliers among 100 measurements (18.00%)
  10 (10.00%) high mild
  8 (8.00%) high severe

(5 or 6 times faster).

system · November 17, 2021, 6:46am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Looking for help understanding Rust's performance vs C++ community	28	6980	November 1, 2019
Dot product performance issue help	2	389	January 2, 2020
Unexpected performance from array bound tests and more	13	2132	January 12, 2023
Code review for a Rust beginner help	11	1440	January 12, 2023
Simple Rust and C# performance comparison help	12	10224	September 19, 2020

Performance penalty on array newtype

Related Topics