Trouble making this low level code as fast as this high level code (not a typo)


#1

Some time ago someone posted a couple of code snippets written in C++ that showed a surprising result: for the inner loop of a split radix fast fourier transform, using the standard library’s templated complex was actually twice as slow as just using a simple struct and inlining the math yourself. Here’s the gist for reference:

I was amenable to the idea that high level code might probably prevent the compiler from seeing certain optimizations, and I knew that if the problem was in LLVM, then Rust would suffer the same issue, so I set out to translate this code snippet in Rust and see if I could replicate the performance disparity. I modified the code so that it does most of the same operations as that inner loop but represents nothing useful - I call it a fake-fourier-transform, so don’t try to decipher it in a math sense. I preserved most of the names so that referencing the source material would be simpler. Also, to preserve the spirit of the original test, I put my complex number datatype and functions in a separate, external crate. The code is listed at the end of this post.

When I translated and benchmarked the code, they ran at mostly the same speed (so nothing like the amazing disparity documented in the gist) but surprisingly the generic code consistently outperforms the manually unrolled and specialized code! This is unintuitive. I even wrote the complex library as generic as possible, with calls to clone. Could someone explain to me what’s going on? Did I do something wrong?

Here is the benchmarking code:

#![feature(test)]
extern crate complex;
extern crate test;
use complex::Complex;

fn main() { }

// Testing a "fake fourier transform" using A) an external library with generic algorithms and
// B) a simple struct specialized to floats and unrolled, manual math.

mod tests {
    const N: usize = 50000;
    use super::*;
    use test::Bencher;
    
    #[bench]
    fn zero_cost(b: &mut Bencher) {
        use complex::Complex;

        let mut x0: [Complex<f32>; N] = [Complex::new(1.0, 1.0); N];
        let mut x1: [Complex<f32>; N] = [Complex::new(1.0, 1.0); N];
        let mut x2: [Complex<f32>; N] = [Complex::new(1.0, 1.0); N];
        let mut x3: [Complex<f32>; N] = [Complex::new(1.0, 1.0); N];

        let mut fft = || {

            unsafe { for i in 0..N {
                let uk = *x0.get_unchecked(i);
                let uk_n1 = *x1.get_unchecked(i);
                let w = *x1.get_unchecked(i);

                let zk = w * *x2.get_unchecked(i);
                let zpk = w.conjugate() * *x3.get_unchecked(i);

                let zsum = zk + zpk;
                let zdif = Complex::new(0.0, -1.0) * (zk - zpk);

                *x0.get_unchecked_mut(i) = uk + zsum;
                *x1.get_unchecked_mut(i) = uk_n1 - zsum;
                *x2.get_unchecked_mut(i) = uk - zsum;
                *x3.get_unchecked_mut(i) = uk_n1 - zdif;
            }}
        };

        b.iter(|| fft());
    }

    #[bench]
    fn raw_struct(b: &mut Bencher) {

        #[derive(Copy, Clone)]
        struct Complex {
            re: f32,
            im: f32,
        }

        let mut x0: [Complex; N] = [Complex { re: 1.0, im: 1.0 }; N];
        let mut x1: [Complex; N] = [Complex { re: 1.0, im: 1.0 }; N];
        let mut x2: [Complex; N] = [Complex { re: 1.0, im: 1.0 }; N];
        let mut x3: [Complex; N] = [Complex { re: 1.0, im: 1.0 }; N];

        let mut fft = || {

            unsafe { for i in 0..N {

                let (zsumr, zsumi, zdifr, zdifi);

                {
                    let w: Complex = *x1.get_unchecked(i);
                    let in2: Complex = *x2.get_unchecked(i);
                    let in3: Complex  = *x3.get_unchecked(i);

                    let zkr = w.re * in2.re - w.im * in2.im;
                    let zki = w.re * in2.im + w.im * in2.re;
                    let zpkr = w.re * in3.re + w.im * in3.im;
                    let zpki = w.re * in3.im - w.im * in3.re;

                    zsumr = zkr + zpkr;
                    zsumi = zki + zpki;
                    zdifr = zki - zpki;
                    zdifi = zpkr - zkr;
                }

                x2.get_unchecked_mut(i).re = x0.get_unchecked(i).re - zsumr;
                x2.get_unchecked_mut(i).im = x0.get_unchecked(i).im - zsumi;
                x0.get_unchecked_mut(i).re += zsumr;
                x0.get_unchecked_mut(i).im += zsumi;
                x3.get_unchecked_mut(i).re = x1.get_unchecked(i).re - zdifr;
                x3.get_unchecked_mut(i).im = x1.get_unchecked(i).im - zdifi;
                x1.get_unchecked_mut(i).re += zdifr;
                x1.get_unchecked_mut(i).im += zdifi;
            }}
        };
        b.iter(|| fft());
    }
}

And here is the complex math library.

use ::std::ops::{Add, AddAssign, Sub, SubAssign, Mul, MulAssign, Neg};

// Helper trait, for brevity
pub trait Num: Add<Output=Self> + AddAssign + Sub<Output=Self> + SubAssign + Mul<Output=Self> + MulAssign + Neg<Output=Self> + Clone { }
impl<T: Add<Output=T> + AddAssign + Sub<Output=T> + SubAssign + Mul<Output=T> + MulAssign + Neg<Output=T> + Clone> Num for T { }

#[derive(Clone)]
pub struct Complex<T: Clone> {
    real: T,
    imag: T,
}

impl<T: Copy> Copy for Complex<T> {
}

impl<T:  Num> Complex<T>
{
    pub fn new(r: T, i: T) -> Complex<T> {
        Complex { real: r, imag: i }
    }

    pub fn real(&self) -> &T {
        &self.real
    }

    pub fn real_mut(&mut self) -> &mut T {
        &mut self.real
    }

    pub fn imag(&self) -> &T {
        &self.imag
    }

    pub fn imag_mut(&mut self) -> &mut T {
        &mut self.imag
    }

    pub fn conjugate(&self) -> Complex<T> {
        Complex::new(self.real().clone(), -self.imag().clone())
    }
}


impl<T: Num> Sub for Complex<T> {
    type Output = Self;
    fn sub(mut self, rhs: Complex<T>) -> Complex<T> {
        self.real -= rhs.real().clone();
        self.imag -= rhs.imag().clone();
        self    }
}

impl<T: Num> Add for Complex<T> {
    type Output = Self;
    fn add(mut self, rhs: Complex<T>) -> Complex<T> {
        self.real += rhs.real().clone();
        self.imag += rhs.imag().clone();
        self
    }
}

impl<T: Num> Mul for Complex<T> {
    type Output = Self;
    fn mul(mut self, rhs: Complex<T>) -> Complex<T> {
        let r;
        let i;

        {
        let a = self.real();
        let b = self.imag();
        let u = rhs.real();
        let v = rhs.imag();

        let c = |a: &T| { a.clone() };

        r = c(a) * c(u) - c(b) * c(v);
        i = c(a) * c(v) + c(b) * c(u);
        }

        self.real = r;
        self.imag = i;
        self
    }
}

impl<T: Num> AddAssign for Complex<T> {
    fn add_assign(&mut self, rhs: Complex<T>) {
        *self.real_mut() += rhs.real().clone();
        *self.imag_mut() += rhs.imag().clone();
    }
}

impl<T: Num> SubAssign for Complex<T> {
    fn sub_assign(&mut self, rhs: Complex<T>) {
        *self.real_mut() -= rhs.real().clone();
        *self.imag_mut() -= rhs.imag().clone();
    }
}

#2

What times are you seeing? Here is what I see:

test tests::raw_struct ... bench:     204,425 ns/iter (+/- 1,707)
test tests::zero_cost  ... bench:     221,690 ns/iter (+/- 2,853)

The low-level one is 8% faster.

EDIT: I am on rustc 1.15.0-nightly (71c06a56a 2016-12-18) for x86_64-unknown-linux-gnu.


#3
test tests::raw_struct ... bench:     276,634 ns/iter (+/- 16,305)
test tests::zero_cost  ... bench:     259,878 ns/iter (+/- 28,516)

This is what I’m getting. It’s within the standard deviation but the raw_struct function being about 6% slower is consistent.

Here’s my compiler:
rustc 1.15.0-nightly (71c06a56a 2016-12-18)

Target is x86_64-pc-windows-gnu.


#4

rustc 1.15.0-nightly (71c06a56a 2016-12-18) for x86_64-unknown-linux-gnu:

test benches::raw_struct ... bench:     201,870 ns/iter (+/- 238)
test benches::zero_cost  ... bench:     180,755 ns/iter (+/- 234)

with -C target-cpu=native:

test benches::raw_struct ... bench:     203,137 ns/iter (+/- 162)
test benches::zero_cost  ... bench:     197,003 ns/iter (+/- 158)

#5

EDIT: Oops, got it! They’re running at the same speed now, thanks!


#6

So what was the issue? :slight_smile:


#7

The issue was that I wasn’t quite sure how to get the compiler to generate faster or equivalent instructions in the “raw_struct” case. I still don’t know in a portable way since cpu=native is not portable!