Hi. I'm hoping someone can help me understand if I'm getting confused or I'm leaving some performance on the table.

In my work, I deal a lot with Jones matrices (2x2, complex valued, either 32- or 64-bit precision). A natural thing to do in Rust is:

```
use num_complex::Complex;
use num_traits::{Float, Num};
#[derive(Clone, Default, PartialEq)]
pub struct Jones<F: Float + Num>([Complex<F>; 4]);
impl<F: Float> Mul<&Jones<F>> for Jones<F> {
type Output = Self;
fn mul(self, rhs: &Self) -> Self {
Self::from([
self[0] * rhs[0] + self[1] * rhs[2],
self[0] * rhs[1] + self[1] * rhs[3],
self[2] * rhs[0] + self[3] * rhs[2],
self[2] * rhs[1] + self[3] * rhs[3],
])
}
}
// other methods elided
```

(I've opted not to use the `Copy`

trait, as potentially copying 8 double-precision floats around "felt" like a lot. I've not done benchmarks on that)

I wanted to make sure that the multiplication of two Jones matrices is as good as it should be (i.e. the same as two arrays), but it appears that this is not the case:

```
use criterion::*;
use num_complex::Complex;
type c64 = Complex<f64>;
fn misc(c: &mut Criterion) {
c.bench_function("multiply Jones<f64>", |b| {
let i = c64::new(1.0, 2.0);
let j1 = Jones::from([i, i + 1.0, i + 2.0, i + 3.0]);
let j2 = Jones::from([i * 2.0, i * 3.0, i * 4.0, i * 5.0]);
b.iter(|| {
let _j3 = j1.clone() * &j2;
})
});
c.bench_function("multiply [c64; 4]", |b| {
let i = c64::new(1.0, 2.0);
let j1 = [i, i + 1.0, i + 2.0, i + 3.0];
let j2 = [i * 2.0, i * 3.0, i * 4.0, i * 5.0];
let mul = |j1: [c64; 4], j2: &[c64; 4]| {
black_box([
j1[0] * j2[0] + j1[1] * j2[2],
j1[0] * j2[1] + j1[1] * j2[3],
j1[2] * j2[0] + j1[3] * j2[2],
j1[2] * j2[1] + j1[3] * j2[3],
])
};
b.iter(|| {
let _j3 = mul(j1.clone(), &j2);
})
});
}
criterion_group!(benches, misc);
criterion_main!(benches);
```

Output of `cargo bench`

```
multiply Jones<f64> time: [6.6112 ns 6.6144 ns 6.6178 ns]
change: [-6.7829% -6.6348% -6.4965%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
multiply [c64; 4] time: [5.5844 ns 5.5885 ns 5.5925 ns]
change: [+1.4145% +1.4973% +1.5759%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
```

(The variances look scary but they're reasonably consistent; the `Jones`

newtype is consistently ~1ns slower than the direct multiplication).

I need to use criterion's `black_box`

to prevent the array benchmark from appearing to have a runtime of 0s, but I don't know if this is causing other problems.

Is my benchmark flawed? Or is there some way to improve my newtype code? I'd like to improve this as much as I can -- I will be multiplying petabytes of these Jones matrices!