How to explain slowdown with Sieve of Eratosthenes versus Nim and C?

pom · January 22, 2016, 11:31pm

I was reading What is special about Nim, and they have an example to show how Nim is as fast as C by implementing the sieve of Eratosthenes in both languages and comparing the runtimes.

I thought it would be fun to compare Rust, so I tried a naive implementation and couldn't match the performance. Here are the timings that I got:

Timings (best of 3):

> rustc --version
rustc 1.8.0-nightly (18b851bc5 2016-01-22)
> rustc -O test-rs.rs -o test-rs
> time ./test-rs
0

real	0m0.714s
user	0m0.679s
sys	0m0.031s

> nim --version
Nim Compiler Version 0.13.0 (2016-01-22) [MacOSX: amd64]
Copyright (c) 2006-2015 by Andreas Rumpf

git hash: a121c3f9eb2a348b9d6ae03ffd01aab26a238c30
active boot switches: -d:release

> nim c -d:release test.nim
> mv test test-nim
> time ./test-nim
0

real	0m0.554s
user	0m0.519s
sys	0m0.027s

Rust source:

#![feature(step_by)]

fn eratosthenes(n: i32) -> Vec<i8> {
	let mut result: Vec<i8> = vec![0; (n + 1) as usize];
	result[0] = 1;
	result[1] = 1;
	let slimit = (n as f64).sqrt() as usize;

	for i in 2..slimit {
	    if result[i] == 0 {
	    	for j in ((i * i)..n as usize).step_by(i) {
	    		result[j] = 1;
	    	}
	    }
	}

	println!("{:?}", result[(n as f64).sqrt() as usize]);
	return result;
}

fn main() {
    eratosthenes(100_000_000);
}

Nim source:

import math

proc eratosthenes(n): auto =
  result = newSeq[int8](n+1)
  result[0] = 1; result[1] = 1

  for i in 0 .. int sqrt(float n):
    if result[i] == 0:
      for j in countup(i*i, n, i):
        result[j] = 1
  
  echo(result[int sqrt(float n)])

discard eratosthenes(100_000_000)

C source:

#include <stdlib.h>
#include <math.h>

char* eratosthenes(int n)
{
  char* sieve = calloc(n+1,sizeof(char));
  sieve[0] = 1; sieve[1] = 1;
  int m = (int) sqrt((double) n);

  for(int i = 0; i <= m; i++) {
    if(!sieve[i]) {
      for (int j = i*i; j <= n; j += i)
        sieve[j] = 1;
    }
  }
  printf("%d", sieve[(int) sqrt((double) n)]);
  return sieve;
}

int main() {
  eratosthenes(100000000);
}

Even though this is a microbenchmark, I want to use this as an opportunity to learn more about Rust by learning why it's not as fast as Nim and C in this scenario, and hopefully to avoid that when I write production code. What am I missing?

steveklabnik · January 23, 2016, 12:06am

Without profiling, my first guess is that it's all the []s; you're getting bounds checks every singel time in Rust.

Re-writing this with iterators or skipping the checks through get_unchecked should speed it up.

ogeon · January 23, 2016, 12:07am

I see that you have optimizations on, so the first mandatory question is already answered. On to the second one: are you sure that you aren't measuring IO speed? Console printing can be surprisingly slow, and the formatting machinery in Rust is different from how C does it, so I would recommend that you don't include it in your measurements. Just to be sure.

matklad · January 23, 2016, 12:11am

I'm not sure, but Nim should also do bounds checking, so this is not entirely fair.

I wonder if there is some iterator incantation, which can help to avoid checks here without usafe ?

EDIT: no, bounds checks in Nim are off with -d:release.

leonardo · January 23, 2016, 12:26am

Your C code doesn't do the same things as the Rust code, there are some differences.

I've used this modified C code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

unsigned char *eratosthenes(const size_t n) {
    unsigned char *result = calloc(n + 1, sizeof(unsigned char));
    result[0] = 1;
    result[1] = 1;
    const size_t s_limit = (size_t)sqrt((double)n);

    for (size_t i = 2; i <= s_limit; i++) {
        if (result[i] == 0) {
            for (size_t j = i * i; j <= n; j += i) {
                result[j] = 1;
            }
        }
    }
    
    printf("%d", result[s_limit]);
    return result;
}

int main() {
    eratosthenes(100000000);
}

Then this Rust code is about as fast (about 5% slower on my system, the back-end is different (Rust uses LLVM), so such difference is not significant):

#![feature(step_by)]

fn eratosthenes(n: usize) -> Vec<u8> {
    let mut result = vec![0u8; n + 1];
    result[0] = 1;
    result[1] = 1;
    let s_limit = (n as f64).sqrt() as usize;

    for i in 2 .. s_limit + 1 {
        if result[i] == 0 {
            for j in ((i * i) .. n + 1).step_by(i) {
                //result[j] = 1;
                unsafe { *result.get_unchecked_mut(j) = 1; }
            }
        }
    }

    println!("{}", result[s_limit]);
    return result;
}

fn main() {
    eratosthenes(100_000_000);
}

Rust code compiled with:
-C opt-level=3 -C target-cpu=native -C lto

C code compiled with gcc 5.3.0:
-Wall -Wextra -Ofast -flto

The Rust compiler isn't able to remove the check of the assignment.

huon · January 23, 2016, 5:49am

I believe -d:release turns bounds checks off.

pom · January 23, 2016, 10:18pm

Yes, Nim seems to turn off bounds checking in release mode.

In the Rust version, I replaced the indexing with get_unchecked and get_unchecked_mut, and that improves it to around 0.600s which is better but still behind Nim and C.

@leonardo I tried your code and the Rust code performs the same as the one I posted with the above mods, and the C code is 20% faster using clang -Ofast.

What would be the idiomatic way to match the performance? I believe it should be possible; I looked at the iterators but I had a hard time figuring out how to wrap this up with them. I'm looking at Rust for writing this sort of code so that I can get performance without the footguns of C, while tying it together with a higher-level language where performance doesn't matter, and it would be unfortunate if I was leaving performance on the table.

birkenfeld · January 23, 2016, 11:19pm

You worry too much about minor percentages. 20% difference in a microbenchmark is not a good data point to look for "idiomatic ways to match performance".

Also, this becomes highly machine and compiler specific. For example, on my system the C and Rust (with bounds checking) programs both take about 2.1 seconds, compiled with GCC 5.3 and Rust 1.8, respectively.

fjh · January 24, 2016, 12:11am

Random observation: replacing the inner for-loop with an equivalent while-loop without iterator-fanciness improves the performance significantly, compared to the original program. I would have expected rustc / llvm to produce identical code for both versions.

fn eratosthenes(n: i32) -> Vec<i8> {
    let mut result: Vec<i8> = vec![0; (n + 1) as usize];
    result[0] = 1;
    result[1] = 1;
    let slimit = (n as f64).sqrt() as usize;

    for i in 2..slimit {
        if result[i] == 0 {
            let mut j = i * i;
            while j < n as usize {
                result[j] = 1;
                j += i;
            }
        }
    }

    println!("{:?}", result[(n as f64).sqrt() as usize]);
    return result;
}

fn main() {
    eratosthenes(100_000_000);
}

huon · January 24, 2016, 1:12am

There's a few places where LLVM isn't a perfect fit for Rust's typical code patterns. For instance, LLVM doesn't understand enums in a great way, which most often rears its head with iterators: it seems to sometimes perform local optimisations inside Iterator::next (which returns Option) which make it harder for callers of it to understand what is happening when next is inlined. E.g. #24660 is an old example with .. (without step_by) that is now fixed. Obviously this is very unfortunate, but it's not a fundamental language issue.

I'm not sure this case is caused directly by enums, although they may well be involved. I filed #31155.

bluss · January 24, 2016, 1:28pm

This version of a stepped iterator improves it, too. Just like the while loop it does not check overflow (leaving that to debug assertions).

use std::ops::Range;

pub struct Step<T> {
    r: Range<T>,
    step: T,
}

impl<T> Step<T> {
    #[inline]
    pub fn new(start: T, end: T, step: T) -> Self {
        Step { r: start..end, step: step }
    }
}

impl Iterator for Step<usize> {
    type Item = usize;
    #[inline]
    fn next(&mut self) -> Option<usize> {
        if self.r.start < self.r.end {
            let elt = self.r.start;
            //self.r.start = self.r.start.saturating_add(self.step);  // does not optimize well
            self.r.start += self.step;
            Some(elt)
        } else {
            None
        }
    }
}

pom · January 25, 2016, 3:09pm

It's also much faster for me with the while loop! I did a comparison of the emitted ASM and as there doesn't seem to be much difference if I use the unsafe get methods or not, so maybe the bounds checks are also getting eliminated (I'm not an expert in reading ASM, however! ) It seems Rust was even able to inline the results of the square root while the C code still calls sqrtsd.

Things look slightly different if we pass in the argument at runtime instead of as a compile-time constant. In this case, there seems to be bounds checking without the unsafe code, but the actual runtime still seems to be just as good as the C version.

Thanks for the feedback and investigation, this was very interesting for me! So it looks mainly like an optimization issue with the iterator which we can hope can be fixed in the future?

Topic		Replies	Views
Nim multithreading faster than Rust help	57	10478	April 9, 2020
Clippy driving me to insanity, insisting on iterators	42	5072	April 13, 2020
Performance of array access vs C	32	5573	August 30, 2020
Executable size and performance vs. C?	42	20271	January 7, 2021
How to speed up this rust code? I'm measuring a 30% slowdown versus the C++ version help	25	10239	January 12, 2023

How to explain slowdown with Sieve of Eratosthenes versus Nim and C?

Related topics