Is Nim faster than Rust?

I thought to compare this both languages in twin_primes as I don't trust Online benchmark so I tried to test 500k twin prime numbers test segmented :

The rust code :

%%bash
cat << 'EOF' > twin_primes_segmented.rs
// twin_primes_segmented.rs
use std::time::Instant;

fn simple_sieve(limit: usize) -> Vec<bool> {
    let mut is_prime = vec![true; limit + 1];
    is_prime[0] = false;
    if limit >= 1 {
        is_prime[1] = false;
    }

    for i in 2..=((limit as f64).sqrt() as usize) {
        if is_prime[i] {
            for j in ((i * i)..=limit).step_by(i) {
                is_prime[j] = false;
            }
        }
    }
    is_prime
}

fn segmented_sieve(start: usize, end: usize, base_primes: &[usize]) -> Vec<bool> {
    let size = end - start + 1;
    let mut is_prime = vec![true; size];

    for &prime in base_primes {
        let start_idx = std::cmp::max(prime * prime, ((start + prime - 1) / prime) * prime);

        for j in (start_idx..=end).step_by(prime) {
            if j >= start {
                is_prime[j - start] = false;
            }
        }
    }

    is_prime
}

fn find_twin_primes(target_count: usize) -> f64 {
    let start_time = Instant::now();

    let sqrt_limit = 1_000_000; // Adjust as needed
    let base_primes_sieve = simple_sieve(sqrt_limit);
    let base_primes: Vec<usize> = (2..=sqrt_limit).filter(|&i| base_primes_sieve[i]).collect();

    let mut count = 0;
    let mut prev_prime = 3;
    let mut current_segment_start = 3;
    let segment_size = 1_000_000;

    loop {
        let segment_end = std::cmp::min(current_segment_start + segment_size - 1, 50_000_000); // Upper bound

        let segment_primes = segmented_sieve(current_segment_start, segment_end, &base_primes);

        for i in 0..segment_primes.len() {
            if segment_primes[i] && current_segment_start + i >= 3 {
                let candidate = current_segment_start + i;
                if candidate - prev_prime == 2 {
                    count += 1;
                    if count == target_count {
                        return start_time.elapsed().as_secs_f64();
                    }
                }
                prev_prime = candidate;
            }
        }

        current_segment_start = segment_end + 1;
        if current_segment_start > 50_000_000 {
            break;
        }
    }

    start_time.elapsed().as_secs_f64()
}

fn main() {
    let elapsed = find_twin_primes(500000);
    println!("{}", elapsed);
}
EOF

And compilation :
%%bash
source $HOME/.cargo/env

rustc -O twin_primes_segmented.rs
./twin_primes_segmented

The result for rust :
0.827 second
Subsequent run :
0.57 then saturated at 0.54 second


Then I tried with Nim also :

%%bash
cat << 'EOF' > twin_primes_segmented.nim
# twin_primes_segmented.nim
import times, math

const
  SegmentSize = 1_000_000
  MaxN = 50_000_000

# Simple sieve (correct)
proc simpleSieve(limit: int): seq[int] =
  var isPrime = newSeq[bool](limit + 1)
  for i in 2..limit:
    isPrime[i] = true

  for i in 2..int(sqrt(float(limit))):
    if isPrime[i]:
      for j in countup(i * i, limit, i):
        isPrime[j] = false

  for i in 2..limit:
    if isPrime[i]:
      result.add(i)

# Segmented sieve
proc segmentedSieve(start, endN: int, basePrimes: seq[int]): seq[uint8] =
  let size = endN - start + 1
  result = newSeq[uint8](size)
  for i in 0..<size:
    result[i] = 1

  for p in basePrimes:
    let p2 = p * p
    if p2 > endN:
      break

    var j = max(p2, ((start + p - 1) div p) * p)
    while j <= endN:
      result[j - start] = 0
      j += p

# Twin primes
proc findTwinPrimes(targetCount: int): float =
  let startTime = epochTime()

  let baseLimit = int(sqrt(float(MaxN)))
  let basePrimes = simpleSieve(baseLimit)

  var count = 0
  var prevPrime = 3
  var segmentStart = 3

  while segmentStart <= MaxN:
    let segmentEnd = min(segmentStart + SegmentSize - 1, MaxN)
    let segment = segmentedSieve(segmentStart, segmentEnd, basePrimes)

    for i in 0..<segment.len:
      if segment[i] == 1:
        let p = segmentStart + i
        if p >= 3:
          if p - prevPrime == 2:
            inc count
            if count == targetCount:
              return epochTime() - startTime
          prevPrime = p

    segmentStart = segmentEnd + 1

  epochTime() - startTime

let elapsed = findTwinPrimes(500_000)
echo elapsed
EOF

Compilation :
%%bash
export PATH=$HOME/.nimble/bin:$PATH

nim c -d:release
--opt:speed
--checks:off
--assertions:off
--overflowChecks:off
twin_primes_segmented.nim

./twin_primes_segmented

The Results :
0.52 second on first run
0.25 second on subsequent run then saturated at 0.23 second

From the test based on Speed the nim appears to be clear winner in this case,
Nim actually even made very small binary
Nim seems to have simple small compiler yet won the race I was amazed by Nim performance yet so simple and easy code and expressive.

Or is there problem in my code ?

1 Like

Well I don't know about the code (I don't speak Nim) but there's definitely an issue with the conclusion:

Even if the Nim code executes faster in this example, that doesn't generalize to "Nim is faster than Rust" (nor vice versa), because of every other kind of operation that could be done eg I/O, async, allocation costs, code optimization in complex scenarios etc.

2 Likes

I didn't concluded anything Yet nor will I I am just asking therefore,
But whatever be the case Nim also seems to be as great especially it's light weight nature of its compiler

1 Like

I don't speak Nim either, but one issue I see in the Rust code is your use of inclusive ranges. Those optimise poorly in LLVM due to having to properly handle the case where your endpoint is the maxium possible integer.

Try using .. instead of ..=.

3 Likes

I read that Cargo's --release adds some extra compiler options, so you might want to try that?

Also consider printing the answers in case of mistakes.

Just a question, is Rust's usize equivalent to Nim's int?

As for my guess for the performance difference, one of the reasons might be that Rust compiles lots of library code into the binary (I remember my hello world is 101KB) and takes time to load when running the binary.

2 Likes

Nim's int type is 64 bit on a 64 bit OS, so it should be identical.

I just tested the ..= range, and while there was no compiler or clippy warning, using .. and adding 1 to the upper limit seems to give a 50% speed gain.

Generally, using --checks:off in Nim should turn of all array index checks and all overflow checks, while Rust enables array index checks by default even in release builds.

But well, Nim was fine, I used it for nine years. But there might be reasons why I switched to Rust two years ago, and never touched Nim again in the last three years :slight_smile:

I am too lazy to compare the Rust and Nim code in detail, but Claude has some doubts:

CAUTION, AI CONTENT:

❯ Fine. The forum post says that the Nim version is faster. I can not test it, as I have currently no Nim installation available. But let us compare the code, if it really does the same. I copied it to src/sieve.nim.

● Read(src/sieve.nim)
⎿ Read 72 lines

● The two versions are not doing the same work. There are two significant differences that favor the Nim version:

  1. Base prime limit — the big one
  • Nim computes base primes up to sqrt(MaxN) = sqrt(50_000_000) ≈ 7,071 → roughly 900 primes
  • Rust computes base primes up to 1,000,000 → 78,498 primes

You only need primes up to sqrt(N) to sieve up to N. The Rust code generates and then iterates over ~87x more base primes than necessary.

  1. Early break in segmented sieve
  • Nim (src/sieve.nim:32-33) breaks out of the base primes loop when p * p > endN, since primes are sorted and no further primes can have multiples in the segment.
  • Rust has no such early exit — it iterates through all 78,498 base primes for every segment, even though the vast majority have no effect.

These compound: Rust iterates over ~78,498 primes per segment with no early exit, while Nim iterates over ~900 and stops early. For the 50 segments processed, that's a lot of wasted loop iterations in the Rust version.

Minor difference:

  • Nim uses uint8 for the segment sieve, Rust uses bool. In practice these are the same size (1 byte), so no real impact.

The algorithm is otherwise identical — same segment size, same twin prime detection logic, same upper bound. The performance gap is almost certainly explained by the base prime limit difference rather than a language speed
difference.

o.O? Will test it out when I have time

1 Like

Confirmed an early break in Nim segmentedSieve:

let p2 = p * p
if p2 > endN:
  break

var j = max(p2, ((start + p - 1) div p) * p)
while j <= endN:
  # ...
  j += p

which is not present in Rust segmented_sieve:

for &prime in base_primes {
    let start_idx = std::cmp::max(prime * prime, ((start + prime - 1) / prime) * prime);

    for j in (start_idx..=end).step_by(prime) {

(Assisted by AI.)

After doing a quick test in the Compiler Explorer to compare .. and ..=. I didn't see anything related to usize::MAX, but I saw that the latter didn't generate an SIMD loop alternative, which LLVM usually does. Instead, it generated only the regular loop, processing items one at a time (even if I try with u32). If the same happens in the code above, it could explain the big impact.

There must be a reason for it, but it's quite the eye-opener.

The issue with ..= have been discussed a few times in this forum in the last two years -- I have remembered to avoid ..=, but not the details. I think it was about a comparison with the upper bound, e.g. when the upper bound is max(i32), so that the compiler can not replace the check with a check against (top + 1). I wonder if a clippy warning would be possible and useful?

One of the discussions I missed:

Reddit - The heart of the internet

2 Likes

Here's a simple inclusive sum function as compiled by godbolt:, annotations mine:

; fn inclusive_sum(a: &[f32], n: usize) -> f32
; rsi = a.len()
; rdi = a.as_ptr()
; rdx = n
example::inclusive_sum::hcf22d73f34342c03:
        xorps   xmm0, xmm0 ; let mut res = 0;
        xor     eax, eax   ; let mut i = 0;
.LBB1_1:
        cmp     rax, rsi   ; let bound_check = i.cmp(a.len())
        jae     .LBB1_5    ; if bound_check.is_ge() { buffer overflow, panic }
        cmp     rax, rdx   ; let comp = i.cmp(n)
        mov     rcx, rax   ; let mut tmp = i
        adc     rcx, 0     ; if comp.is_lt() { tmp += 1 }
        addss   xmm0, dword ptr [rdi + 4*rax] ; res += a[i]
        cmp     rax, rdx   ; let comp = i.cmp(n)
        jae     .LBB1_4    ; if comp.is_ge() { return res } 
        mov     rax, rcx   ; i = tmp
        cmp     rcx, rdx   ; let comp = tmp.cmp(n)
        jbe     .LBB1_1    ; if comp.is_le() { loop }
.LBB1_4:
        ret                ; return res
.LBB1_5:
        ; panic on buffer overflow
        ; 8< snip 8<

The interesting part that takes care of the n == usize::MAX case is this:

        cmp     rax, rdx   ; let comp = i.cmp(n)
        mov     rcx, rax   ; let mut tmp = i
        adc     rcx, 0     ; if comp.is_lt() { tmp += 1 }
        ; 8< snip 8<
        cmp     rax, rdx   ; let comp = i.cmp(n)
        jae     .LBB1_4    ; if comp.is_ge() { return ret }
        mov     rax, rcx   ; i = tmp

The temporary variable is needed because i cannot be unconditionally incremented without risking overflow. So a temp variable is instead incremented if it's safe to do so (ie. if i < n)[1], then i is again compared against n[2] to check if it's time to return, and if not (ie. if i < n), i is then assigned the incremented value. This extra logic means the optimizer isn't able to vectorize or unroll the loop.

In this case i can never reach usize::MAX anyway, because the maximum length of a (edit: non-ZST) slice is in fact isize::MAX as usize, but the optimizer isn't smart enough to exploit that fact. Sometimes it's possible to convince LLVM to generate better code by manually asserting bounds like that, but in this case it doesn't seem to help :frowning: But you can at least get rid of the unoptimal, redundant slice bounds checking by manually testing it once before the loop.


  1. cmp x, y sets the carry flag iff x < y (technically, if sub x, y has unsigned underflow); adc z, 0 adds the carry flag to z, thus incrementing z iff x < y. ↩︎

  2. the same comparison has to be done twice because the flags may have been clobbered :frowning: ↩︎

7 Likes

That's not technically true. The maximum memory allocation is isize::MAX: assert_eq!(usize::MAX, [(); usize::MAX].as_slice().len());.

Yes, I was focused on the non-ZST [f32] in the example :grin:

1 Like

I assumed you were aware of that, but I was replying just in case others weren't[1][2].


  1. I could be even more obnoxious and say that's still not correct since you said "maximum" and not "upper bound" (i.e., the max length of [f32] is isize::MAX / 4) ;). ↩︎

  2. While probably pedantic, I actually think it's quite easy to write buggy code where one assumes the length of a slice is bound by isize::MAX seeing how relatively common generic code is in Rust. ↩︎

Thanks for the detailed explanation! I see the problem, now.

Hi sorry therefore I asked is my code correct:
I corrected the bug and even removed ..=

The results Rust:
0.42 sec then saturated at 0.40

The Nim:
0.40 sec then saturated at 0.27

I corrected error in the Nim code as well which was making it slow.
Seems like this is real battle not much changes between rust and nim still small nim compiler is great powerful not to underestimate

2 Likes

I tried to create a summary in my own words :slight_smile:

Loops with inclusive ranges

Sometimes it can be convenient to use loops with ranges or explicit conditions where
the terminating condition is inclusive as in the following quite common example:

#[inline(never)]
pub fn inclusive_sum(a: &[f32], n: usize) -> f32 {
    let mut res = 0.0;
    for i in 0..=n {
        res += a[i];
    }
    res
}

Or we might use explicit loop conditions like while i <= max or while i >= min.
This use of inclusive ranges seems to be fine, as CPU's typically support inclusive range tests directly.
However, a performance problem might occur, when the loop terminating condition uses a value which
is unknown and unbounded at compile time, as n in the example code above. So that value might actually have the
largest or smallest possible value of the underlying data type.

Without an additional overflow check, explicit and implicit tests for the conditions <= or >= would never fail!
While the Rust compiler typically disables overflow checks in release builds, it has to keep them for security reasons
for such inclusive loops with arbitrary limits, and this can have an noticeable performance impact.
So for performance critical code, it might be a good idea to avoid the use of inclusive conditions and ranges, at least for
cases where the loop boundary in uncertain, e.g. a arbitrary function parameter. For cases where the limit is a constant expression or
a value for which the compiler knows the possible values from the context, this is typically not an issue. This includes typically inlined functions as well.

This is not an accurate description; rather, the impl<A> Iterator for RangeInclusive<A> has explicitly written logic to iterate fully without performing any potentially overflowing arithmetic. That code is not an overflow check, let alone a special-cased one. It's just a correct implementation of doing what it says it does.

6 Likes

Important to note when The same code of Nim was compiled using nim cpp command (using my friend advice) command the code was much faster it took just 0.22 second for first time and 0.20 for subsequent run I don't know why.

I want to learn similar optimization in Rust also

I could disregard the obvious impracticality of having usize::MAX items in an array, which would exhaust the virtual address range when the iterator is used directly as an index without transformation (plus the limit you mentioned). Maybe it’s considered too specific, though it would significantly pay off. Or maybe it’s risky if some trick is used to address only a part of the range near its end, though I doubt it’s common practice (such tricks should simply require to disable the optimization).

But I wonder about the SIMD part. The back-end optimizer splits a loop into two parts: the SIMD loop, which processes n / 2^B items, and the remainder loop, which processes n % 2^B items Since the maximum iterator value won’t be a multiple of 2^B (2, 4, 8, ... values), I think the SIMD part could be used as normal and the double register pattern could be used for the remainder.

As for unrolling, it seems unable to do that even when the range is fixed at compilation time. Yet it seems trivial to transform 0..=15 into 0..16, or even to test the upper bound before deciding which optimization is acceptable.

It’s surprising those cases aren’t handled, when I see the optimizations it’s capable of pulling off in general. I assume the SIMD optimization is simply not as advanced as the rest.

1 Like