Executable size and performance vs. C?


#21

I tried Steve’s version and that runs in about .23 seconds on my machine, vs. about .15 seconds for C.


#22

I would also write this this way, though with more newlines :slightly_smiling:


#23

I get a much higher variance with the C version:

steve@warmachine:~/tmp$ time ./hello 
31626

real	0m0.124s
user	0m0.124s
sys	0m0.000s
steve@warmachine:~/tmp$ time ./hello 
31626

real	0m0.153s
user	0m0.148s
sys	0m0.004s
steve@warmachine:~/tmp$ time ./hello 
31626

real	0m0.124s
user	0m0.120s
sys	0m0.000s
steve@warmachine:~/tmp$ time ./hello 
31626

real	0m0.195s
user	0m0.192s
sys	0m0.000s
steve@warmachine:~/tmp$ time ./hello 
31626

real	0m0.181s
user	0m0.176s
sys	0m0.000s
steve@warmachine:~/tmp$ time ./hello 
31626

real	0m0.169s
user	0m0.168s
sys	0m0.000s

Same compiler flags. So, sometimes faster, sometimes slower, whereas the Rust is very consistent. I have no idea why this is.

Actually, sorry, that was gcc. with clang -O3:

steve@warmachine:~/tmp$ time ./a.out 
31626

real	0m0.115s
user	0m0.112s
sys	0m0.000s

steve@warmachine:~/tmp$ time ./a.out 
31626

real	0m0.119s
user	0m0.116s
sys	0m0.000s
steve@warmachine:~/tmp$ time ./a.out 
31626

real	0m0.154s
user	0m0.152s
sys	0m0.000s
steve@warmachine:~/tmp$ time ./a.out 
31626

real	0m0.168s
user	0m0.164s
sys	0m0.000s

Same deal.

By the way, I also edited the above message to use proper formatting too. It’s markdown: use either four spaces, or triple backticks to show code.


#24

Strange. I am not seeing that variance:

dca@igor:~/Euler/rust$ cd …/c
dca@igor:~/Euler/c$ ls
problem_20 problem_20.c problem_20_in_c
dca@igor:~/Euler/c$ time ./problem_20
31626

real 0m0.160s
user 0m0.157s
sys 0m0.000s
dca@igor:~/Euler/c$ time ./problem_20
31626

real 0m0.165s
user 0m0.163s
sys 0m0.000s
dca@igor:~/Euler/c$ time ./problem_20
31626

real 0m0.155s
user 0m0.153s
sys 0m0.000s
dca@igor:~/Euler/c$ time ./problem_20
31626

real 0m0.155s
user 0m0.157s
sys 0m0.000s
dca@igor:~/Euler/c$ time ./problem_20
31626

real 0m0.166s
user 0m0.163s
sys 0m0.000s
dca@igor:~/Euler/c$ time ./problem_20
31626

real 0m0.156s
user 0m0.153s
sys 0m0.000s
dca@igor:~/Euler/c$


#25

Please, format your code and output… It’s a lot easier to read for us.
Enclose your code in triple back ticks like this:

```
// Code goes here
```

For me there is almost no difference between C and Rust (with iterators) execution time:

C:    0.21s user 0.00s system 99% cpu 0.209 total
Rust: 0.20s user 0.01s system 98% cpu 0.212 total 

C is a little faster between 0.205 and 0.209 than Rust between 0.210 and 0.216. Doesn’t seem really significant to me.

Ah yeah, I like to write if/else expressions on one line when they fit and it’s for the purpose of returning a value depending on a condition. It’s like the ternary operator but more explicit :slightly_smiling:


#26

Thanks for the formatting tip; I hadn’t taken the time to figure out how to use this site and saw that the code was pretty ugly…

Even though the code I submitted earlier gets the right answer in all versions, there’s some confusion, vestiges of changes in how I wrote it as I was solving the problem. For example, the test for < 10000 in the amicable function isn’t necessary; it’s never going to see an n>=10000 because of the how the main function is written.

Here are improved versions.

C:

#include <stdio.h>
#include <stdlib.h>

int d(int n) {
    int n_half=n/2, m, result=1;
    for (m=2; m<=n_half; m++)
        if ((n % m) == 0)
            result += m;
    return result;
}

int amicable(int n) {
    int dn=d(n), ddn=d(dn);
    if ((ddn == n) && (n != dn))
        return n;
    else
        return 0;
}

int main () {
    int i, result=0;
    for (i=1; i<10000; i++) 
        result += amicable(i);
    printf("%d\n", result);
    exit(0);
}

Rust:

fn amicable(n:i32) -> i32 {
    let dn=d(n);
    let ddn=d(dn);
    if (ddn==n) && (n != dn) {
        n
    } else {
        0
    }
}

fn d(n: i32) -> i32 {
    (2..(n / 2 + 1)).
        filter(|m| n % m == 0).
        fold(0, |result, i| result + i)
}

fn main () {
    let mut result:i32=0;
    for i in 1..10000 {
        result = result + amicable(i);
    }
    println!("{}", result);
}

And since a couple of you have gotten Haskell-ey, here’s a Haskell version:

d :: Int -> Int
d n = foldl (\acc m -> if (mod n m)==0; then acc+m; else acc) 1 [2..(div n 2)]

amicable :: Int -> Int
amicable n = if ((d (d n)) == n) && (n /= (d n)); then n ; else 0
main = do print (foldl (\acc n -> acc + (amicable n)) 0 [1 .. 9999])

On my machine, the C version runs in about .16 seconds. The Rust version runs in about .23 seconds, or a little under 44% slower than the C version. The Haskell version runs in about .84 seconds, or 425% slower than the C version.

As I said before, the machine is a Thinkpad X-250, 4 GB memory, 7200 rpm disk, Arch Linux, up-to-date.


#27

In your C code you’re using ints, while in the Rust version you’re using i32, that is a 32 bit signed integer. Are those two types equal on your system? Above I’ve used int32_t in the C code to be sure it’s a more correct comparison.


#28

Yes, they’re the same.


#29

(A tip I didn’t know until fairly recently is perf stat -r 10 ./program, which will run a program repeatedly and summarize the time it takes.)


#30

Nice! Okay, so here’s a full run from me:

steve@warmachine:~/tmp$ cat hello.rs 
fn d(n: i32) -> i32 {
    (2..(n / 2 + 1)).
        filter(|m| n % m == 0).
        fold(0, |result, i| result + i)
}

fn amicable(n: i32) -> i32 {
    let dn = d(n);
    let ddn = d(dn);

    if ddn == n && n != dn && n < 100_000 {
        n
    } else {
        0
    }
}

fn main() {
    let mut result = 0;

    for i in 1 .. 10_000 {
        result += amicable(i);
    }

    println!("{}", result);
}
steve@warmachine:~/tmp$ rustc -O hello.rs 
steve@warmachine:~/tmp$ perf stat -r 10 ./hello
41274
41274
41274
41274
41274
41274
41274
41274
41274
41274

 Performance counter stats for './hello' (10 runs):

        134.867537      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.37% )
                 2      context-switches          #    0.016 K/sec                    ( +-  6.06% )
                 1      cpu-migrations            #    0.004 K/sec                    ( +- 53.75% )
               106      page-faults               #    0.786 K/sec                    ( +-  0.24% )
       435,105,586      cycles                    #    3.226 GHz                      ( +-  0.03% )
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
       536,636,455      instructions              #    1.23  insns per cycle          ( +-  0.00% )
       164,832,900      branches                  # 1222.184 M/sec                    ( +-  0.00% )
           154,391      branch-misses             #    0.09% of all branches          ( +-  0.14% )

       0.135123580 seconds time elapsed                                          ( +-  0.38% )

steve@warmachine:~/tmp$ cat hello.c 
#include <stdio.h>
#include <stdlib.h>

int d(int n) {
    int n_half=n/2, m, result=1;
    for (m=2; m<=n_half; m++)
        if ((n % m) == 0)
            result += m;
    return result;
}

int amicable(int n) {
    int dn=d(n), ddn=d(dn), result=0;
    if ((ddn == n) && (n != dn))
        if (n < 10000)
            result += n;
    return result;
}

int main () {
    int i, result=0;
    for (i=1; i<10000; i++) 
        result += amicable(i);
    printf("%d\n", result);
    exit(0);
}
steve@warmachine:~/tmp$ clang -O3 hello.c
steve@warmachine:~/tmp$ perf stat -r 10 ./a.out
31626
31626
31626
31626
31626
31626
31626
31626
31626
31626

 Performance counter stats for './a.out' (10 runs):

        115.504819      task-clock (msec)         #    0.998 CPUs utilized            ( +-  3.26% )
                 2      context-switches          #    0.021 K/sec                    ( +- 16.67% )
                 1      cpu-migrations            #    0.004 K/sec                    ( +- 53.75% )
                46      page-faults               #    0.398 K/sec                    ( +-  1.30% )
       361,705,040      cycles                    #    3.132 GHz                      ( +-  0.02% )
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
       340,614,308      instructions              #    0.94  insns per cycle          ( +-  0.00% )
        10,571,430      branches                  #   91.524 M/sec                    ( +-  0.01% )
            44,740      branch-misses             #    0.42% of all branches          ( +-  0.48% )

       0.115752287 seconds time elapsed                                          ( +-  3.27% )

steve@warmachine:~/tmp$ uname -a
Linux warmachine 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux

Neat!


#31

@steveklabnik and @donallen just for the sake of correctness, in your last Rust implementations you get the wrong final answer printed because you start the fold accumulator at zero instead of one.

fold(0, |result, i| result + i)

gives 41274 as final result

fold(1, |result, i| result + i)

gives the original, I assume correct, value of 31626 :slightly_smiling:


#32

Whoops! At least that shouldn’t affect the speed :sweat_smile:


#33

31626 is correct, confirmed by my submission to the Euler Project website last night.


#34

Rust 1.8 nightly, Clang 3.7.1, Intel core i5:

C:

 Performance counter stats for './c' (10 runs):

        157.773517      task-clock (msec)         #    0.983 CPUs utilized          
                18      context-switches          #    0.113 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                44      page-faults               #    0.275 K/sec                  
       460,557,670      cycles                    #    2.879 GHz                      (83.14%)
       299,530,716      stalled-cycles-frontend   #   64.80% frontend cycles idle     (83.14%)
        68,134,185      stalled-cycles-backend    #   14.74% backend  cycles idle     (66.32%)
       348,827,544      instructions              #    0.75  insns per cycle        
                                                  #    0.86  stalled cycles per insn  (83.45%)
         3,424,487      branches                  #   21.404 M/sec                    (83.79%)
            53,505      branch-misses             #    1.61% of all branches          (83.66%)

       0.160472723 seconds time elapsed                                          ( +-  0.40% )

Rust:

 Performance counter stats for 'target/release/rust' (10 runs):

        161.201039      task-clock (msec)         #    0.975 CPUs utilized          
                18      context-switches          #    0.109 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                81      page-faults               #    0.492 K/sec                  
       469,508,148      cycles                    #    2.850 GHz                      (82.89%)
       254,773,475      stalled-cycles-frontend   #   53.88% frontend cycles idle     (83.28%)
        78,178,590      stalled-cycles-backend    #   16.53% backend  cycles idle     (67.17%)
       455,225,406      instructions              #    0.96  insns per cycle        
                                                  #    0.57  stalled cycles per insn  (83.63%)
       123,972,610      branches                  #  752.610 M/sec                    (83.63%)
           245,925      branch-misses             #    0.20% of all branches          (83.12%)

       0.165250902 seconds time elapsed                                          ( +-  0.62% )

The runtime is within 5 milliseconds. What I find Interesting is the number of branches in C vs Rust…


#35

Hm, this is a pretty long thread about performance to have not spoken a word about cpu scaling or cpu affinity. I guess it’s short running enough that affinity doesn’t matter so much (as evidenced by the 0 cpu-migrations), but please check that /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor is set to performance (on Linux) if you want more consistent results. If you want cpu affinity, take a look at likwid.


#36

I’ve just discovered how immature LLVM/Clang was on ARM.

In the benchmark results posted above there doesn’t seem to be much difference between C and Rust on x86_64, however on my ARM system it’s like:

gcc’s 1.5s to rust’s (or clang’s 3.6) 3.4s

meaning the newer llvm 3.7 bundled with rust makes no difference.


New version of mandel-rust: uses Rayon, added benchmark
#37

A source of branches is the checking Rust does for the % operator (just like for division). It often optimizes out, but here it looks like it doesn’t, yet the impact of those branches is pretty minimal. I’d switch to unsigned integers to remove some of the panic cases that are checked for in the code.


#38

Thanks, that’s definitely true. Rewriting d to use a for loop reduces the number of branches from 125M to 80M, so there is still some optimization potential for the iterator based version. However, the branches are also well-predicted and therefore I see only a 1-2ms (<1%) difference.


#39

An update on the ARM issue - it seems it was just a little regression in LLVM (quite mature otherwise) leading to the much slower modsi3 being chosen over divmod on processors without hardware division support.

Soon to be fixed in 3.8. (thanks ARM guys)

EDIT:
The patch is here which brings performance in line with gcc’s (1.5s vs 1.6s).
A Happy End!


#40

I was curious about the difference cited with switching to iterator adapters, so I looked at the LLVM IR for these versions of d:
A (fast):

fn d(n: i32) -> i32 {
    let mut result: i32 = 1;
    for m in (2..(n / 2 + 1)).filter(|m| n % m == 0) {
        if n % m == 0 {
            result = result + m;
        }
    }
    result
}

B (slow):

fn d(n: i32) -> i32 {
    let mut result: i32 = 1;
    for m in (2..(n / 2 + 1)) {
        if n % m == 0 {
            result = result + m;
        }
    }
    result
}

The first one has the odd duplication in an attempt to isolate the difference, as on my machine it had the same performance as the version with fold.

On my machine, A runs around 0.54s while B runs around 0.73s.

Here are graphs of the optimized IR:

http://dl.dropbox.com/u/1237941/esap-a.pdf

http://dl.dropbox.com/u/1237941/esap-b.pdf

As you can see, the slower one (B) is much simpler, while the faster one (A) has an additional panic check (in A, panic_loc3976 is “attempted remainder with a divisor of zero”, and panic_loc3979 is “attempted remainder with overflow”; B has somehow gotten rid of the overflow one). The faster one also lacks a nsw (no signed wrap) on the add of 1, which is supposed to allow optimization, while the slower one keeps it. Makes total sense, right?

Assembly:

http://dl.dropbox.com/u/1237941/esap-a.s

http://dl.dropbox.com/u/1237941/esap-b.s

Edit: One odd thing in the assembly is some redundant code in the slow version:

+0x42	    cmpl                %r10d, %edi
+0x45	    setl                %al
+0x48	    movzbl              %al, %eax
+0x4b	    addl                %edi, %eax
+0x4d	    cmpl                %r10d, %edi
+0x50	    movl                %edi, %esi
+0x52	    movl                %eax, %edi
+0x54	    jl                  +0x30