Is Nim faster than Rust?

This is one of those things that we might be able to improve with the new range type (coming soon).

The current ops::RangeInclusive is one type for both the range and the iterator, which makes it hard to have a place to put that kind of thing while still being worth it overall.

The new RangeInclusive in std::range - Rust however is not an iterator, so the into_iter() call could do some fixups.

...err, saying that realized that as defined on nightly the signature means it actually can't, really. Wrote a comment on the tracking issue suggesting that we quickly change that Oh, it is bounded, rustdoc just doesn't put it in the <> that apparently was the only part my brain looked at :person_facepalming:

2 Likes

Wouldn’t dealing with multiple backends like LLVM, GCC and Cranelift make that sort of things quite complicated?

No, because it would just be in rust code in the standard library.

For example, change this from being a pass-through to something that looks at the bounds:

What I meant was, if each backend had its own way of dealing with features like inclusive range, and possibly a different optimization workaround than the other backends, there would be no way to improve that in the Rust code unless you specify which one is used.

But I don’t know if they all have their own quirks, neither for inclusive ranges or other potential under-optimizations. It’s likely the workaround for one compiler isn’t a problem for the others.

Anyway, it was just out of curiosity, nothing more. :smiley:

It's largely that iterating MIN..=MAX is just fundamentally harder because of the state space, not because of some specific property of LLVM. Even while i <= n { i += 1; } has the problem that n might be MAX, and that's the core of why it's more annoying to optimize: either it needs to accept that it might be an infinite loop, and thus pessimize things because of that, or handle the special overflow case, and thus pessimize things because of that.

2 Likes

This should be roughly equally fast:

    if MIN <= MAX {
        let end = MAX.wrapping_add(1);
        let mut i = MIN;
        loop {
            body(i);
            i = i.wrapping_add(1);
            if i == end {
                break;
            }
        }
    }
1 Like

Well yes, because you added a separate check outside the loop. That's basically what I'm suggesting -- the into_iter call is before the loop.

1 Like

A regular while loop (or for i in a..b) typically compiles to the same pattern. You need a conditional jump before the first iteration in case the loop is empty, and also another conditional jump at the end of the loop to see whether to go back for another iteration. In both cases the nonempty check can sometimes be optimized out.

I thought we discussed that in my earlier post, but maybe I wasn’t clear. For now, LLVM only generates the alternative loop with two loop registers shown a few posts above, but it could split it into a first SIMD loop for the safe counter values and a 2nd loop with that alternative 2-register loop for the remaining “risky” values, which potentially contains MAX. That’d already be a significant improvement without tackling a possible wraparound. The remainder would take more cycles than the optimal case (the code used for the exclusive range), but not dramatically so.

EDIT: I mean a split like this, which LLVM does naturally for exclusive ranges. I removed the index check for clarity (some initialization and last <= 7 exception code removed, too):

; pub fn sum_u32_opt(last: usize, v: &[u32]) -> u32
sum_u32_opt:
        mov     rcx, rdi
        and     rcx, -8    ; let n1 = last & !7
        je      .LBB2_1
        cmp     rdi, 7
        ja      .LBB2_5
>------------snip------------<
        pxor    xmm0, xmm0 ; let mut sum = 0;
        xor     eax, eax   
        pxor    xmm1, xmm1        
.LBB2_6:                   
        ; for i in 0..n1 { sum += v[i]; }  
        movdqu  xmm2, xmmword ptr [rsi + 4*rax]
        paddd   xmm0, xmm2
        movdqu  xmm2, xmmword ptr [rsi + 4*rax + 16]
        paddd   xmm1, xmm2
        add     rax, 8
        cmp     rcx, rax
        jne     .LBB2_6
        paddd   xmm1, xmm0
        pshufd  xmm0, xmm1, 238
        paddd   xmm0, xmm1
        pshufd  xmm1, xmm0, 85
        paddd   xmm1, xmm0
        movd    eax, xmm1
.LBB2_8:
        ; for i in n1..=last { sum += v[i]; }  
        mov     rdx, rcx
        cmp     rcx, rdi
        adc     rcx, 0
        add     eax, dword ptr [rsi + 4*rdx]
        cmp     rdx, rdi
        jae     .LBB2_10
        cmp     rcx, rdi
        jbe     .LBB2_8
.LBB2_10:
        ret

Or it could use the preliminary test you’re mentioning instead.

So for me, it’s simply a case that LLVM doesn’t optimize yet, maybe because the safe, 2-register loop is generated in an earlier optimization stage or even before, which prevents the SIMD parallelization if it’s done in a later stage. I haven’t tried on GCC to see if it had the same limitation, but I suppose that each backend has a set of situations where it needs a little guidance, hence my original question.

Anyway, it wasn’t really important.

There are differences in the code

  • Nim baseline limit is let baseLimit = int(sqrt(float(MaxN))) and uses early return if p * p > endN. Using dynamic array u8
  • Rust baseline limit is 1_000_000, not using early return. There is additional branch check that doesn't exist in Nim but does exist in Rust if j >= start. Using Vec

That does not affect the count result, but affects the operations needed to get the result, as Rust does more loops, branching, and Vec may have lower performance than Vec

I tried to make the code apple to apple

use std::time::Instant;

const SEGMENT_SIZE: usize = 1_000_000;
const MAX_N: usize = 50_000_000;

fn simple_sieve(limit: usize) -> Vec<usize> {
    let mut is_prime = vec![1u8; limit + 1];
    if limit >= 1 { is_prime[1] = 0; }

    let sqrt = (limit as f64).sqrt() as usize;
    for i in 2..=sqrt {
        if is_prime[i] == 1 {
            let mut j = i * i;
            while j <= limit {
                is_prime[j] = 0;
                j += i;
            }
        }
    }

    let mut primes = Vec::new();
    for i in 2..=limit {
        if is_prime[i] == 1 {
            primes.push(i);
        }
    }
    primes
}

fn segmented_sieve(start: usize, end: usize, base_primes: &[usize]) -> Vec<u8> {
    let size = end - start + 1;
    let mut segment = vec![1u8; size];

    for &p in base_primes {
        let p2 = p * p;
        if p2 > end {
            break;
        }

        let mut j = if p2 >= start {
            p2
        } else {
            ((start + p - 1) / p) * p
        };

        while j <= end {
            segment[j - start] = 0;
            j += p;
        }
    }

    segment
}

fn find_twin_primes(target: usize) -> f64 {
    let start_time = Instant::now();

    let base_limit = (MAX_N as f64).sqrt() as usize;
    let base_primes = simple_sieve(base_limit);

    let mut count = 0usize;
    let mut prev_prime = 3usize;
    let mut segment_start = 3usize;

    while segment_start <= MAX_N {
        let segment_end = (segment_start + SEGMENT_SIZE - 1).min(MAX_N);
        let segment = segmented_sieve(segment_start, segment_end, &base_primes);

        for i in 0..segment.len() {
            if segment[i] == 1 {
                let p = segment_start + i;
                if p - prev_prime == 2 {
                    count += 1;
                    if count == target {
                        return start_time.elapsed().as_secs_f64();
                    }
                }
                prev_prime = p;
            }
        }

        segment_start = segment_end + 1;
    }

    let time = start_time.elapsed().as_secs_f64();
    // println!("count : {}", count); just to inspect the count result
    time
}

fn main() {
    let elapsed = find_twin_primes(500_000);
    println!("{}", elapsed);
}

And added this to the release profile to dissable runtime check like the Nim command

[profile.release]
opt-level = 3
codegen-units = 1
panic = 'abort'
debug = false
overflow-checks = false

Not using target-cpu=native because Nim also does not use it

Cargo build --release. Now Rust speed is around 0.2, 0.19, and 0.18. The fastest is 0.18, while the more consistent result is around 0.2

Nim is fast because it is C transpiler, Nim is compiled to C then it is compiled to machine code

1 Like

That’s interesting, but wouldn’t we need a proper frame of reference by comparing the Nim execution on the same system? Two absolute values on two systems that are likely different are hard to compare.

Note also the difference mentioned by others before about inclusive ranges, which generated all that discussion. This could speed up things, too.

I’m not sure about Nim being faster because it’s transpiled to C. The way I see it, it may be less performant than a proper compiler that can use the full semantic context when generating the intermediate representation for the backend. But maybe that doesn’t make a significant difference for Nim; I don’t know the language well enough.

I forgot to include Nim's speed on my machine, Nim is consistently 0.2 as well

Machine : VPS Ubuntu 24 LTS X86 64, 1 GB RAM, 1 Core CPU

1 Like

@Redglyph Looking at the assembly using Rust Playground, the Rust code I edited above still has many bound checks and no auto vectorization (sometimes, even without the --target=native, the compiler can auto vectorize if the access patterns are SIMD friendly)

They don't really like the term "transpiled". And they have actually good reasons: Transpiled indicates a translation from one language to another on the same abstraction level. But Nim is a much higher level language than the quite simple C. So Nim is compiled to C -- C is just a chosen backend, which had the advantages that it is very portable, so can be used where LLVM is not available, and the intermediate code can be easily inspected. But Nim had an unofficial LLVM backend as well.

But you are correct, just because a language is converted to C it will not be fast automatically. But for Nim, the C backend matches the language well, so often Nim is fast as the same code written in C.

Note, they are working on a new compiler generation, called Nimoy or Nimony, which might become Nim v 3. There might be some interesting developments going on, with a new intermediate code format, incremental compilation, and I think they can even use now CPS (Continuation-passing style) to support or as an alternative of ASYNC.

But I left Nim three years ago, so my knowledge might be outdated.

1 Like

That’s why I mentioned the inclusive range and the previous discussion. :slight_smile:

I know there are different opinions about the term. Personally, I don’t see any harm in using a more precise term, but transpiler is just a subset of compiler, so it’s not incompatible, and it doesn’t have to mean it doesn’t do much to translate the code.

C is still a 3rd generation programming language, like Nim. For me, a compiler produces an output of the lowest level, like assembly (2nd gen) or machine code (1st gen), which doesn’t require another compiler. It takes a considerable amount of extra effort for the backend to transform and optimize the output, so the distinction is meaningful.

I wouldn’t call C a backend language either. C requires a fair amount of frontend work to parse it, so we still need a full compiler to produce the final output.

But it must already require a lot of work to produce an IR and interface with backends like LLVM, so I understand why they did it. As you said, it’s also easier to inspect the output.

That’s interesting to know, thanks for the information. I suppose it makes sense, at this point. I looked into the language a while ago (just a little), but, like you, I left that a few years ago. It has interesting features, and I like the resemblance to Python, but I find it a little messy in some other ways.

Update interpretation of the assembly : There is no bound check, but a check to determine the end of the loop, which is logically impossible to remove because it tells the CPU when to stop, using an iterator would still require this check. There is a capacity check in vector.push (I do not know can this be removed using Vec::with_capacity, but that would make the code comparison no longer apple to apple), and a division by zero check

CMIIW :>

The code

use std::time::Instant;

const SEGMENT_SIZE: usize = 1_000_000;
const MAX_N: usize = 50_000_000;

#[inline(never)]
fn simple_sieve(limit: usize) -> Vec<usize> {             
    let mut is_prime = vec![1u8; limit + 1];
    //if limit >= 0 { is_prime[0] = 0; }
    if limit >= 1 { is_prime[1] = 0; }
    
    let sqrt = (limit as f64).sqrt() as usize;
    
    for i in 2..=sqrt {
        if is_prime[i] == 1 {
            let mut j = i * i;          
            
            while j <= limit {
                is_prime[j] = 0;
                j += i;
            }                                                 
            
        }
    }

    let mut primes = Vec::new();                          
    for i in 2..=limit {
        if is_prime[i] == 1 {
            primes.push(i);
        }
    }
    primes
}

#[inline(never)]
fn segmented_sieve(start: usize, end: usize, base_primes: &[usize]) -> Vec<u8> {
    let size = end - start + 1;
    let mut segment = vec![1u8; size];

    for &p in base_primes {
        let p2 = p * p;
        if p2 > end {
            break;
        }

        let first_multiple = if p2 >= start {
            p2
        } else {
            let rem = start % p;
            if rem == 0 { start } else { start + (p - rem) }
        };

        let mut j = first_multiple;
        let offset = start;

        let end_limit = end;

        while j <= end_limit {
            unsafe {
                *segment.get_unchecked_mut(j - offset) = 0;
            }
            j += p;
        }
    }

    segment
}

#[inline(never)]
fn find_twin_primes(target: usize) -> f64 {
    let start_time = Instant::now();

    let base_limit = (MAX_N as f64).sqrt() as usize;
    let base_primes = simple_sieve(base_limit);

    let mut count = 0usize;
    let mut prev_prime = 3usize;
    let mut segment_start = 3usize;

    while segment_start <= MAX_N {
        let segment_end = (segment_start + SEGMENT_SIZE - 1).min(MAX_N);
        let segment = segmented_sieve(segment_start, segment_end, &base_primes);

        for i in 0..segment.len() {
            if segment[i] == 1 {
                let p = segment_start + i;
                if p - prev_prime == 2 {
                    count += 1;
                    if count == target {
                        return start_time.elapsed().as_secs_f64();
                    }
                }
                prev_prime = p;
            }
        }

        segment_start = segment_end + 1;
    }

    let time = start_time.elapsed().as_secs_f64();
    // println!("count : {}", count);
    time
}

fn main() {
    let elapsed = find_twin_primes(500_000);
    //println!("{}", elapsed);
    std::hint::black_box(elapsed);
}
**playground::simple_sieve:**
	push	rbp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	sub	rsp, 24
	mov	r14, rdi
	call	qword ptr [rip + __rustc::__rust_no_alloc_shim_is_unstable_v2@GOTPCREL]
	mov	edi, 7072
	mov	esi, 1
	call	qword ptr [rip + __rustc::__rust_alloc@GOTPCREL]
	test	rax, rax
	je	.LBB0_19
	mov	rbx, rax
	mov	edx, 7072
	mov	rdi, rax
	mov	esi, 1
	call	qword ptr [rip + memset@GOTPCREL]
	mov	byte ptr [rbx + 1], 0
	mov	eax, 2

.LBB0_2:
	lea	rcx, [rax + 1]
	cmp	rax, 84
	cmove	rcx, rax
	cmp	byte ptr [rbx + rax], 1
	jne	.LBB0_5
	mov	rdx, rax
	imul	rdx, rax

.LBB0_4:
	mov	byte ptr [rbx + rdx], 0
	add	rdx, rax
	cmp	rdx, 7072
	jb	.LBB0_4

.LBB0_5:
	cmp	rax, 84
	je	.LBB0_6
	mov	rax, rcx
	cmp	rcx, 84
	jbe	.LBB0_2

.LBB0_6:
	mov	qword ptr [rsp], 0
	mov	qword ptr [rsp + 8], 8
	mov	qword ptr [rsp + 16], 0
	mov	r13d, 2
	mov	eax, 8
	xor	r12d, r12d
	mov	r15, rsp

.LBB0_7:
	lea	rbp, [r13 + 1]
	cmp	r13, 7071
	cmove	rbp, r13
	cmp	byte ptr [rbx + r13], 1
	jne	.LBB0_12
	cmp	r12, qword ptr [rsp]
	jne	.LBB0_11
	mov	rdi, r15
	call	alloc::raw_vec::RawVec<T,A>::grow_one
	mov	rax, qword ptr [rsp + 8]

.LBB0_11:
	mov	qword ptr [rax + 8*r12], r13
	inc	r12
	mov	qword ptr [rsp + 16], r12

.LBB0_12:
	cmp	r13, 7071
	je	.LBB0_14
	mov	r13, rbp
	cmp	rbp, 7071
	jbe	.LBB0_7

.LBB0_14:
	mov	rax, qword ptr [rsp + 16]
	mov	qword ptr [r14 + 16], rax
	movups	xmm0, xmmword ptr [rsp]
	movups	xmmword ptr [r14], xmm0
	mov	esi, 7072
	mov	edx, 1
	mov	rdi, rbx
	add	rsp, 24
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	jmp	qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]

.LBB0_19:
	mov	edi, 1
	mov	esi, 7072
	call	qword ptr [rip + alloc::raw_vec::handle_error@GOTPCREL]
	mov	r14, rax
	mov	rsi, qword ptr [rsp]
	test	rsi, rsi
	je	.LBB0_17
	mov	rdi, qword ptr [rsp + 8]
	shl	rsi, 3
	mov	edx, 8
	call	qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]

.LBB0_17:
	mov	esi, 7072
	mov	edx, 1
	mov	rdi, rbx
	call	qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
	mov	rdi, r14
	call	_Unwind_Resume@PLT

**playground::segmented_sieve:**
	push	rbp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	push	rax
	mov	rbx, rdx
	sub	rbx, rsi
	inc	rbx
	jns	.LBB1_3
	xor	r15d, r15d

.LBB1_2:
	mov	rdi, r15
	mov	rsi, rbx
	call	qword ptr [rip + alloc::raw_vec::handle_error@GOTPCREL]

.LBB1_3:
	mov	r14, r8
	mov	r12, rcx
	mov	r13, rdx
	mov	rbp, rsi
	mov	qword ptr [rsp], rdi
	je	.LBB1_4
	call	qword ptr [rip + __rustc::__rust_no_alloc_shim_is_unstable_v2@GOTPCREL]
	mov	r15d, 1
	mov	esi, 1
	mov	rdi, rbx
	call	qword ptr [rip + __rustc::__rust_alloc@GOTPCREL]
	test	rax, rax
	je	.LBB1_2
	mov	r15, rax
	jmp	.LBB1_7

.LBB1_4:
	mov	r15d, 1

.LBB1_7:
	mov	rdi, r15
	mov	esi, 1
	mov	rdx, rbx
	call	qword ptr [rip + memset@GOTPCREL]
	test	r14, r14
	je	.LBB1_24
	lea	rcx, [r12 + 8*r14]
	mov	rsi, r15
	sub	rsi, rbp

.LBB1_10:
	mov	rdi, qword ptr [r12]
	mov	rax, rdi
	imul	rax, rdi
	cmp	rax, r13
	ja	.LBB1_24
	cmp	rax, rbp
	jae	.LBB1_17
	test	rdi, rdi
	je	.LBB1_22
	mov	rdx, rbp
	cmp	rbp, rdi
	jb	.LBB1_15
	mov	eax, ebp
	xor	edx, edx
	div	edi

.LBB1_15:
	mov	rax, rbp
	test	rdx, rdx
	je	.LBB1_17
	lea	rax, [rdi + rbp]
	sub	rax, rdx
	jmp	.LBB1_17

.LBB1_18:
	mov	byte ptr [rsi + rax], 0
	add	rax, rdi

.LBB1_17:
	cmp	rax, r13
	jbe	.LBB1_18
	add	r12, 8
	cmp	r12, rcx
	jne	.LBB1_10

.LBB1_24:
	mov	rax, qword ptr [rsp]
	mov	qword ptr [rax], rbx
	mov	qword ptr [rax + 8], r15
	mov	qword ptr [rax + 16], rbx
	add	rsp, 8
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	ret

.LBB1_22:
	lea	rdi, [rip + .Lanon.1d992ed0de9679575ee15502e1faffa0.1]
	call	qword ptr [rip + core::panicking::panic_const::panic_const_rem_by_zero@GOTPCREL]
	ud2
	mov	r14, rax
	test	rbx, rbx
	je	.LBB1_21
	mov	edx, 1
	mov	rdi, r15
	mov	rsi, rbx
	call	qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]

.LBB1_21:
	mov	rdi, r14
	call	_Unwind_Resume@PLT

.LCPI2_0:
	.long	1127219200
	.long	1160773632
	.long	0
	.long	0

.LCPI2_1:
	.quad	0x4330000000000000
	.quad	0x4530000000000000

.LCPI2_2:
	.quad	0x41cdcd6500000000

**playground::find_twin_primes:**
	push	rbp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	sub	rsp, 88
	call	qword ptr [rip + std::time::Instant::now@GOTPCREL]
	mov	qword ptr [rsp + 48], rax
	mov	dword ptr [rsp + 56], edx
	lea	rdi, [rsp + 64]
	call	playground::simple_sieve
	mov	rax, qword ptr [rsp + 72]
	mov	qword ptr [rsp + 8], rax
	mov	r14, qword ptr [rsp + 80]
	mov	r15d, 3
	xor	ebp, ebp
	mov	r12d, 3
	jmp	.LBB2_1

.LBB2_14:
	add	rbx, 1000000
	cmp	r12, 49000001
	mov	r12, rbx
	jae	.LBB2_15

.LBB2_1:
	cmp	r12, 49000001
	mov	ebx, 49000001
	cmovb	rbx, r12
	lea	rdx, [rbx + 999999]
	lea	rdi, [rsp + 16]
	mov	rsi, r12
	mov	rcx, qword ptr [rsp + 8]
	mov	r8, r14
	call	playground::segmented_sieve
	mov	rax, qword ptr [rsp + 32]
	test	rax, rax
	je	.LBB2_12
	mov	r13, qword ptr [rsp + 24]
	mov	rdx, r15
	xor	ecx, ecx
	jmp	.LBB2_4

.LBB2_5:
	mov	r15, rdx

.LBB2_23:
	inc	rcx
	mov	rdx, r15
	cmp	rax, rcx
	je	.LBB2_12

.LBB2_4:
	cmp	byte ptr [r13 + rcx], 1
	jne	.LBB2_5
	lea	r15, [r12 + rcx]
	mov	rsi, r15
	sub	rsi, rdx
	cmp	rsi, 2
	jne	.LBB2_23
	inc	rbp
	cmp	rbp, 500000
	jne	.LBB2_23
	jmp	.LBB2_25

.LBB2_12:
	mov	rsi, qword ptr [rsp + 16]
	test	rsi, rsi
	je	.LBB2_14
	mov	rdi, qword ptr [rsp + 24]
	mov	edx, 1
	call	qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
	jmp	.LBB2_14

.LBB2_25:
	lea	rdi, [rsp + 48]
	call	qword ptr [rip + std::time::Instant::elapsed@GOTPCREL]
	mov	r14, rax
	mov	ebp, edx
	mov	rsi, qword ptr [rsp + 16]
	test	rsi, rsi
	je	.LBB2_28
	mov	edx, 1
	mov	rdi, r13
	call	qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]

.LBB2_28:
	movq	xmm2, r14
	punpckldq	xmm2, xmmword ptr [rip + .LCPI2_0]
	subpd	xmm2, xmmword ptr [rip + .LCPI2_1]
	movapd	xmm1, xmm2
	unpckhpd	xmm1, xmm2
	cvtsi2sd	xmm0, ebp
	jmp	.LBB2_17

.LBB2_15:
	lea	rdi, [rsp + 48]
	call	qword ptr [rip + std::time::Instant::elapsed@GOTPCREL]
	movq	xmm2, rax
	punpckldq	xmm2, xmmword ptr [rip + .LCPI2_0]
	subpd	xmm2, xmmword ptr [rip + .LCPI2_1]
	movapd	xmm1, xmm2
	unpckhpd	xmm1, xmm2
	cvtsi2sd	xmm0, edx

.LBB2_17:
	divsd	xmm0, qword ptr [rip + .LCPI2_2]
	addsd	xmm1, xmm2
	addsd	xmm0, xmm1
	mov	rsi, qword ptr [rsp + 64]
	test	rsi, rsi
	je	.LBB2_19
	movsd	qword ptr [rsp + 40], xmm0
	shl	rsi, 3
	mov	edx, 8
	mov	rdi, qword ptr [rsp + 8]
	call	qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
	movsd	xmm0, qword ptr [rsp + 40]

.LBB2_19:
	add	rsp, 88
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	ret
	jmp	.LBB2_7
	mov	r14, rax
	mov	rsi, qword ptr [rsp + 16]
	test	rsi, rsi
	je	.LBB2_8
	mov	edx, 1
	mov	rdi, r13
	call	qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
	jmp	.LBB2_8

.LBB2_7:
	mov	r14, rax

.LBB2_8:
	mov	rsi, qword ptr [rsp + 64]
	test	rsi, rsi
	je	.LBB2_10
	shl	rsi, 3
	mov	edx, 8
	mov	rdi, qword ptr [rsp + 8]
	call	qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]

.LBB2_10:
	mov	rdi, r14
	call	_Unwind_Resume@PLT

**playground::main:**
	push	rax
	call	playground::find_twin_primes
	movsd	qword ptr [rsp], xmm0
	mov	rax, rsp
	#APP
	#NO_APP
	pop	rax
	ret

1 Like

Besides making the code apple to apple, an optimization I have in mind is buffer reuse, because the hot loop currently allocates a new vector every time. Additionally, I'm thinking of restructuring the overall code to be auto vectorization and loop unrolling friendly

More important factor to be considered is best performance without much check to look where code is slow.
And it clearly means as a beginner it is difficult to make high performance code in Rust, I need to learn rust with dedication to use it's performance.

Hmm...When I first discovered Rust of course I just had to implement some of my old little C and C++ programs in it just to see if the performance was acceptable. With my beginners Rust and without even trying I was happy to see that Rust easily approached the speeds I was getting in C/C++.