Slower when split in two files

anthomedal · October 24, 2019, 8:17pm

Hello everyone,

Assume you have the following code :

struct StaticMatrix<'a>
{
    /*
    A n x m matrix
    */

    // Dimension
    n : usize,
    m: usize,

    // Values
    v: &'a mut [f64],
}

fn mat_mul(a: StaticMatrix, b: StaticMatrix, s: StaticMatrix)
{
    if a.m != b.n
    {
        panic!("Inconsistent matrix shape in multiplication");
    }

    let mut inter: f64;

    let m_b = b.m;
    let n_b = b.n;
    let n_a = a.n;
    let n_s = s.n;

    for j in 0..m_b
    {
        for i in 0..n_a
        {
            inter = 0.0;

            for k in 0..m_b
            {
                inter = inter + a.v[i+n_a*k]*b.v[k+n_b*j];
            }

            s.v[i + j*n_s] = inter;
        }
    }
}

fn main() {
    const N: usize = 1000;

    //let mut values: [f64; N*N] = [0.0;N*N];

    let a = StaticMatrix {
        n: N,
        m: N,
        v: &mut [0.0;N*N]
    };

    let b = StaticMatrix {
        n: N,
        m: N,
        v: &mut [0.0;N*N]
    };

    let mut s = StaticMatrix {
        n: N,
        m: N,
        v: &mut [0.0;N*N]
    };

    mat_mul(a, b, s);
}

As you can see, it is a basic implementation of matrix multiplication. When I compile this with --release flag, it run in 0m0,043s. However if I split this code in two files like this :

static_matrix.rs
pub struct StaticMatrix<'a>
{
/*
A n x m matrix
*/

    // Dimension
    pub n : usize,
    pub m: usize,

    // Values
    pub v: &'a mut [f64],
}

pub fn mat_mul(a: StaticMatrix, b: StaticMatrix, s: StaticMatrix)
{
    if a.m != b.n
    {
        panic!("Inconsistent matrix shape in multiplication");
    }

    let mut inter: f64;

    let m_b = b.m;
    let n_b = b.n;
    let n_a = a.n;
    let n_s = s.n;

    for j in 0..m_b
    {
        for i in 0..n_a
        {
            inter = 0.0;

            for k in 0..m_b
            {
                inter = inter + a.v[i+n_a*k]*b.v[k+n_b*j];
            }

            s.v[i + j*n_s] = inter;
        }
    }
}

main.rs

mod static_matrix;

use crate::static_matrix::StaticMatrix;
use crate::static_matrix::mat_mul;
fn main() {
    const N: usize = 1000;

    //let mut values: [f64; N*N] = [0.0;N*N];

    let a = StaticMatrix {
        n: N,
        m: N,
        v: &mut [0.0;N*N]
    };

    let b = StaticMatrix {
        n: N,
        m: N,
        v: &mut [0.0;N*N]
    };

    let mut s = StaticMatrix {
        n: N,
        m: N,
        v: &mut [0.0;N*N]
    };

    mat_mul(a, b, s);
}

First I overflow my stack (I solve that with ulimit Linux command), but in addition I have an execution time of 0m9,700s, that is 200 time slower (always with --release flag)... I just want to know why I observe this and how to split this code without performance loss ?

Thank you !

adeschamps · October 24, 2019, 9:05pm

The stack is overflowing because you're creating an array that's too large to fit on the stack - a million f64s is 8 MB. I think the default stack size is usually 2 MB.

However, in the first case, the compiler is optimizing away the entire program, so those arrays never get created in the first place. Since there are no observable side effects of this program (nothing gets printed or written to a file, for example) the compiler is able to determine that simply skipping the program would be essentially be the same as running it. This is why the first version is so fast - it's not doing anything at all.

I'm not sure why this optimization didn't get applied when you split the code into two files. But you can also prevent the optimization by printing a value from one of the arrays - that should force the compiler to actually generate code, and therefore overflow the stack.

ZiCog · October 24, 2019, 9:20pm

I have no idea but for

fn mat_mul(...)

The compiler knows for sure that it is only called in one place and has no output and hence can delete it all.

But

pub fn mat_mul(...)

is not so easy. When compiling that file the compiler has no idea who calls it or where the results go, so it has to generate code.

But now I'm wondering why it is so slow in the single file case, when it does nothing!

anthomedal · October 25, 2019, 7:35am

Thank you for your answers !

I have the same time in the first case when I print values.

That's very interesting. However if the second version do effectively the calculus, why it need 9s ? That's slow, with octave or julia it needs less than 1s (with * operation over matrices), in my mind Rust should be very fast...

Thank you

Cerber-Ursi · October 25, 2019, 7:44am

Well, it was not literally doing nothing - it seems that Rust compiled it into some kind of busy loop, i.e. every iteration reduced to nothing, but the iterations themselves are preserved.

ZiCog · October 25, 2019, 8:08am

It is slow because you are doing a huge matrix multiplication with the naive algorithm we all learn in school. There are other alogrithms for matrix multiplication that are optimized for large matrices and much faster. For example: The Strassen algorithm: Strassen algorithm - Wikipedia

I'm sure if you implement an optimized algorithm Rust will prove to be plenty fast enough.

anthomedal · October 25, 2019, 8:19am

Oh ! I am in the fifth year of a master's degree in mathematics and I didn't hear about optimization of this operation ! I will study that !

Thank you very much !

ZiCog · October 25, 2019, 8:28am

As wikipedia points out, the schoolboy matrix multiplication takes approximately 2 to the power (N cubed) (where N = 2 n ) arithmetic operations. So it's getting very slow pretty soon as your matrices get bigger. For the Strassen algorithm that (N cubed) becomes (N to the power 2.8). Much better.

You might find it instructive to run your program for different sized matrices, from very small to very large, and plot the resulting run times against size. You will find you need a log/log plot to get a nice straight line. The slope of which should agree with the above formula.

ZiCog · October 25, 2019, 8:47am

Ha!

Well, I guess mathematics is a huge field and not many mathematicians bother themselves with the mundane details of doing arithmetic.

I'm no mathematician but I find these things fascinating. In as much as I can understand them. Some of these algorithms are simply amazing.

It took me about twenty years from seeing my first Fast Fourier Transform program (in BASIC) to actually understanding it well enough that I could implement an FFT from scratch in C. Try as I might I could not understand any of the descriptions of the workings of the FFT, despite doing years of maths and covering Fourier analysis in uni.

Recently I posed a little coding challenge, to calculate the first number in the Fibonaci sequence that contains one million decimal digits. That led to finding and implementing the Karasuba algorithm for large integer multiplication and optimized ways to calculate big fibo numbers. That brought the execution time down from minutes to a couple of hundred milli-seconds! You can see it run as WASM in a web page in your browser here: https://otaniemi.conveqs.fi/public/fibo.html

Sounds like you might be just the person to implement the Strassen algorithm in Rust. Hint, hint...

TomP · October 25, 2019, 2:03pm

Also, when comparing timings avoid debug compiles, as those are always much slower. Search this forum for hints on how to further optimize your code if that's important.

anthomedal · October 25, 2019, 2:40pm

Yes algorithms useful for applied maths (and other applied domains) are are often very interesting !

I have impression that there is no many numeric libs in Rust (it's my impression, i'm not sure), and when I see the power of this language I think that a lib equivalent to LAPACK (Fortran) written in Rust should be very powerful. I would be honored to build this algorithm in Rust (is it not for the moment ?) and why not the beginin of a lib

anthomedal · October 25, 2019, 2:57pm

I don't understand what do you mean by "debug compiles" ?

H2CO3 · October 25, 2019, 2:58pm

Your feeling is right, Rust is not at all strong in the field of numerics yet, although in theory it certainly is a great candidate. The memory model of the safe subset is especially appealing; I've read someplace that some Fortran implementations of numeric algorithms were faster than their C counterparts because Fortran functions' array-valued parameters admit non-aliasing rules that C can't prove, hence Fortran compilers don't need to be nearly as conservative when performing certain optimizations. Rust could certainly replicate most of these advantages given its borrowing + exclusivity model.

Cerber-Ursi · October 25, 2019, 3:17pm

That's referring to what you're already doing: need the speed - build only with --release.

ZiCog · October 25, 2019, 5:33pm

Be gentle,

We are facing the fact that during a five year masters degree study of mathematics a student may know nothing about the practicalities of computing. For example:

Do not create megabytes of variables on the stack. It will only end in tears.
Compilers can do amazing feats of optimization in exchange for the amount of time it takes to compile a program. "cargo run --release" can result in a program an order of magnitude faster than a regular "cargo run"
People have been thinking about optimizing all kind of algorithms so as to get them done faster on computers for decades, from sorting, to multiplication, to Fourier transforms, to matrix multiplication etc, etc,

I get the feeling that in general mathematicians do not think about time. If an equation, algorithm, proof is true then it is true. Some how instantaneously over all time. Never mind how long it may take to calculate the thing in practice.

That task comes down to the branch of mathematics called "computer science".

anthomedal · October 25, 2019, 6:44pm

I agree. In fact I saw experienced mathematicians create difficult algorithms for solve difficult problems and know nothing about code (unable to code in python or matlab...). I think is harmful because you can't build good algorithm if you know nothing about the machine.

For my part I try to understand sophisticated language (like Rust) because I think it's useful to be a good "modern" mathematician !

ZiCog · October 25, 2019, 7:31pm

I don't know.

It seems to me that we have had the idea of "algorithm" ever since Euclid and his greatest common divisor idea. No matter if it is fast to calculate or not. The goal is to prove that it is correct.

The whole issue of dealing with actual big numbers and data sets for practical purposes was not on the mathematicians agenda until recently. Think of all those human "computers" calculating tables of artillery trajectories a hundred years ago. Or Richard Feynman and co. on the Los Alamos project.

Back in 1980 something a maths postgrad friend of mine always said how he hated computer use in mathematics. After all, if you need the computer to understand the problem that meas that you do not. Where is the fun in that?

I would not blame any mathematician for pursuing the purity of the proof. The exploration of the model. No mater how impractical it may be. So many times these crazy ideas have become useful many years, generations, later.

ZiCog · October 25, 2019, 7:35pm

Sorry, I kind of missed that statement.

In the abstract an algorithm is nothing to do with any machine. In any practical incarnation. A mathematician should not be limited by such mundane thoughts.

system · January 23, 2020, 7:35pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Looking for help understanding Rust's performance vs C++ community	28	6977	November 1, 2019
Dot product performance issue help	2	389	January 2, 2020
SIMD version of function runs slower than normal version help	16	1184	May 24, 2021
SIMD help with code slower in Release mode help	7	1176	November 27, 2019
Speed difference - Method call vs Function call help	7	821	May 2, 2021

Slower when split in two files

Related Topics