A Friendly Challenge for Rust

jzakiya · December 27, 2018, 5:34am

OK, I figured out the vector initialization stuff; docs need to be corrected|updated.

Here's the Rust nextp_init version, with test suite.
So now have translated and tested modInv, sozpg, and nextp_init.

// Global variables
static modpg: i32 = 30;
static rescnt: i32 = 8;
static residues: [i32; 8] = [7, 11, 13, 17, 19, 23, 29, 31];
static resinvrs: [i32; 8] = [13, 11, 7, 23, 19, 17, 29, 1];
static pos: [i32; 30] = [0,0,0,0,0,0,0,0,0,1,0,2,0,0,0,3,0,4,0,0,0,5,0,0,0,0,0,6,0,7];
static primes: [i32; 6] = [7, 11, 13, 17, 19, 23];
static restwins: [i32; 3] = [13, 19, 31];

fn nextp_init(indx: i32, prims: Vec<i32>) -> Vec<u64> {
  // Init|return 'nextp' array for given twin pair at 'indx' in 'restwins'.
  // Set each row[j] w/1st prime multiple resgroup for each prime r1..sqrt(N).
  let pcnt = prims.len();                // size for each row in nextp
  let mut nextp = vec![0; pcnt * 2];     // 1st mults array for this twin pair
  let r_hi = restwins[indx as usize];    // upper twin pair residue value
  let r_lo = r_hi - 2;                   // lower twin pair residue value
  let (row_lo, row_hi) = (0, pcnt);      // nextp addr for lo|hi twin pair
  for (j, prime) in prims.iter().enumerate() { // for each prime r1..sqrt(N)
    let k = (prime - 2) / modpg;         // find the resgroup it's in
    let r = (prime - 2) % modpg + 2;     // and its residue value
    let r_inv = resinvrs[pos[(r - 2) as usize] as usize]; // and residue inverse
    let mut ri = (r_lo * r_inv - 2) % modpg + 2; // compute the ri for r
    nextp[row_lo + j] = (k * (prime + ri) + (r * ri - 2) / modpg) as u64;
    ri = (r_hi * r_inv - 2) % modpg + 2;         // compute the ri for r
    nextp[row_hi + j] = (k * (prime + ri) + (r * ri - 2) / modpg) as u64;
  }
  nextp
}

fn main() {
  println!("{:?}", nextp_init(0, primes.to_vec()) );
  println!("{:?}", nextp_init(1, primes.to_vec()) );
  println!("{:?}", nextp_init(2, primes.to_vec()) );
}

And output.

$ ./nextp_init  
[5, 11, 7, 7, 18, 5, 4, 8, 13, 16, 4, 8]
[2, 2, 12, 17, 14, 14, 1, 10, 5, 9, 19, 17]
[3, 6, 9, 3, 6, 9, 2, 3, 2, 12, 11, 12]

VladMoldoveanu · December 27, 2018, 1:32pm

I have a working version of the code: https://github.com/VladMoldoveanu/SSoZ
It's not commented/optimized at all, it's just a translation of the nim code.

Also, I noticed that running the nim program with input 100 gives the largest twin pair prime as (59,61) instead of (71, 73). Inherently, my program does that too. When running on input 300 it gives the right answer, so it might be an uncovered special case/limit case.

jzakiya · December 27, 2018, 2:44pm

Hey great code! I am really learning a lot of Rust from it.

I caught one error (omission) in precalculate.rs.

This:

    while j < res_cnt - 1 {
        if residues[j] + 2 == residues[j + 1] {
            res_twins.push(residues[j + 1]);
        }
        j += 1;
}

should be this.

    while j < res_cnt - 1 {
        if residues[j] + 2 == residues[j + 1] {
            res_twins.push(residues[j + 1]);
            j += 1;
        }
        j += 1;
}

See if that fixes the error.

VladMoldoveanu · December 27, 2018, 3:23pm

Thanks, I corrected that, but it didn't change anything. I ran into the same problem when running your nim code (for input 100), so it might not be from my implementation.
I've updated the structure of the code (removed the struct and made all variables local for the function). I hope it's easier to understand until I find all bugs and commented the code.

Also, I discovered another problem. The number of counted primes below 10^9 is 3486938 instead of 3424506 (on my code).

As for performance, it runs 6 times slower than the nim code. I hope that solving the bug will help. Also, I'll need to check that locks are not overused.
EDIT: I looked at the performance results from the article, after compiling and running on my laptop it seems that my solution is only 1.5-2 times slower.

jzakiya · December 27, 2018, 4:07pm

Oh man, I must have messed up the final code when cleaning it up. Older versions work correctly for those values. I know where the problem is (it's at the end of twins_sieve). I won't have time to debug it till later in day. Thanks for the catch though.

To check the count results I use primesieve (primesieve.org).

I was wondering though, why you used some variables as i128. I wrote the code to cover values within 64-bits (which progressively take more time). Of course conceptually you can go as high as you want in number space, but doing that in real space-time is another discussion.

I'm going to finish my version implementation for my own learning|curiosity. Then we'll have two versions to compare against.

Tweaking twins_sieve is the best place to achieve speedups from. You may want to compare using a byte vector versus a bitvector. You can set up a separate test suite for twins_sieve just to do that.

I'll get back when I find the bug(s).

VladMoldoveanu · December 27, 2018, 9:15pm

I will be away for a few days, but until then:

I managed to find my bug and now my code gives the same result as the nim source.
I also applied a first batch of optimizations, the only function that needs them is twin_sieve as others take way less and optimizations don't make any real difference.

For performance I added these flags to the compiler: RUSTFLAGS="-C target-cpu=native" and I get a time somewhere between primesieve and the nim version.

I have no idea why I used i128. Probably was too tired and tried to be sure there will be no error (you initialized an indexing variable with -1). Now all data types are usize, which is the same as u64 on all 64-bit platforms. I'm not sure how usize works internally and if I should only make those who are supposed to be indices of this type.

bitvector appears to be better only in unoptimized code, so I'm back to byte/bool vector

Re-included the link to the repo: https://github.com/VladMoldoveanu/SSoZ

jzakiya · December 28, 2018, 8:56am

OK, found bug (of omission). I had forgot a flag in code logic. The sums where always correct but the last twin value could be wrong, now it's fixed. I ended up coding 4 variations to see if any discernible speed differences, but cursory tests didn't show any. The Nim code version below is shortest, probably clearest, version for fix at end of twins_sieve.

   hi_tp = hi_tp * modpg.uint + r_hi    # numerate final seg largest twin prime
                                        # see if largest tp is inside range
   if hi_tp > num:                      # if outside, find sum|val that's inside
     var prev = true                    # flag to use last tp from previous seg
     for k in 0..upk:                   # count down from upk resgroup to '0'
       if seg[upk - k].int == 0:        # if twin prime at seg resgroup address
         if hi_tp <= num.uint: prev = false; break # keep if in range, flip flag
         sum.dec                        # else reduce sum for too large twin
       hi_tp -= modpg.uint              # then check next lower twin pair hi val
                                        # keep from prev seg if none in range
     if prev: hi_tp = if r_hi > num: 0'u else: k_hi * modpg.uint + r_hi
   lastwins[indx] = hi_tp               # store un|adjusted final seg tp value
   cnts[indx] = sum                     # store correct|ed sum for twin pair

Also, I notice you use a lot of mutex|locks in your code, which I think are unnecessary.

There are no race conditions between threads, as they generate|contain their own memory and data. The only shared arrays that are written to are cnts and lastwins, but each thread writes to its own index|location in them. I also saw some Rust example code that uses spawn (as in Nim) which seemed a lot simpler for parallel processing.

After I do more test of corrected Nim code, I'll revise|release update paper, and separate code gist.

Your code looks much cleaner|simpler and less noisy now (most those [x as usize] gone).

VladMoldoveanu · December 28, 2018, 12:30pm

I'll fix the code as soon as I'm back.
The mutex on cnts and lastwins are necessary because Rust cannot know what you do with them. I lock them only at the end of the function, so there should be a delay as small as possible.

If you were talking about Arc, it is necessary to send the threads references to the data. From what I know, the compiler removes those extra structures at compile time.

Threadpool has an easy interface and is easy to join the threads. If you prefer spawn, go for it.

mmstick · December 29, 2018, 4:44am

Are these indexes known before spawning the threads? If so, there is a way to safely share slices / indexes of an array in Rust with multiple threads without requiring a mutex. The split_mut method, for example. Rayon also has some iterative powers that could be useful here.

jzakiya · December 29, 2018, 11:05pm

Yes, the number of indexes is a known immutable value determined at compile time for each prime generator, with a specific prime generator selected at run time.

In the Nim code pairscnt is the number of indexes to iterate over. I just updated the Nim code in the paper, which is below (will release updated revision of paper too soon).

gist.github.com

https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

twinprimes_ssoz.nim

#[
 This Nim source file is a multiple threaded implementation to perform an
 extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.
 
 Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
 Output is the number of twin primes <= N, or in range N1 to N2; the last
 twin prime value for the range; and the total time of execution.
 
 This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning

This file has been truncated. show original

VladMoldoveanu · December 30, 2018, 1:25pm

I tried different variations of those, but no relevant time changes. I ended up using the standard thread::spawn with a mpsc channel for the results in order to get rid of the Mutex, ThreadPool and num_cpus.

The only performance changes can be made inside twin_sieve, more specifically here:

for (i, prime) in primes.iter().enumerate() {
    let mut k = next_p[i];
    while k < kn {
        seg[k] |= 1;
        k += prime;
    }
    next_p[i] = k - kn;
    k = next_p[p_cnt + i];
    while k < kn {
        seg[k] |= 1;
        k += prime;
    }
    next_p[i + p_cnt] = k - kn;
}
let mut cnt = 0usize;
seg.iter().take(kn).for_each(|&x| if x == 0 {cnt += 1});

This takes most of the computation time, so any small change affects the running time for large numbers. It is similar to the usual sieve, so some of the optimizations in primesieve can theoretically be implemented here too.

Also, I changed the laptop I work on with one with an older, but more powerful CPU ( I7-4710HQ from i3-6006U) and to Windows 10 from Ubuntu 18.04. Now I get the same/a bit smaller time than the nim solution without the flag I mentioned previously.

jzakiya · December 30, 2018, 5:27pm

Hey welcome back Vlad.

I don't know if you noticed the other change I made, but changing these two lines in your twin_sieve may make it a smidgen faster.

let (mut sum, mut ki, kn) = (0usize, 0usize, kb);
...
if kb > (k_max - ki) {kn = k_max - ki;}

This does a little less work inside the loop, since kn only needs to change for the last slice, so we don't have to keep setting it to kb for every slice before then.

Conceptualizing the inner loop work
The ideal conceptualization of the segment array is a bitarray, but to implement that on a byte addressable machine will always be slower than using a native byte array. But how do we do better than a straight byte array?

In Forth, I can create a chunk of memory and address it however I want. So instead of using bytes we can (again conceptually) create the seg array as a contiguous number of 64-bit memory locations (aligned on 64-bit boundaries) which can be addressed as bytes, et al.

It may be possible then to create an inner loop function, that given (k, prime, kn) constructs lego blocks of repeating bit patterns to write to each 64-bit word efficiently.

To count the [non]primes in the segment we can use something like popcount on each 64-bit word. This is the approach used in primesieve in places.

Also, primesieve uses a much more elaborate caching and segmentation scheme. I presently don't understand it enough to assess how to possibly use it.

However, Rayon's Sieve of Eratosthenes (SoE) uses something like it, I will try to learn.

https://github.com/rayon-rs/rayon/blob/master/rayon-demo/src/sieve/mod.rs

So now that you've got a correctly working Rust reference, the rest is implementation optimizations. A next step is an OpenCl|CUDA translation, which I imagine will mostly pertain to the parallelization code(?). Of course that should fly, especially as the inputs get larger, and would take greater advantage of a gpu's cores.

So at this point, we've established the superior efficacy of prime generator math over a SoE, et al, and going forward is learning how to maybe rearchitect|implement it better. (A hardware version would be the ultimate implementation.)

When you get some time, can you post benchmarks between the Nim, Rust versions on the platforms you use (with language|compiler versions, et al). I'm really interested in how the code performs on non-Intel systems (AMD, ARM, PowerPC, etc), especially on newer systems with faster processors|memory, as I only have a circa 2016 I7 laptop with 16GB.

I also want to thank you for accepting my friendly challenge, as I (and I'm sure all the others who have been following this thread) have learned a lot of real world Rust coding from it.

VladMoldoveanu · December 31, 2018, 10:06am

Happy to be back and glad I could help!

It's possible to do something similar in unsafe Rust.
Another feature that might be really useful is inline assembly (documentation might be outdated, but couldn't find anything more recent). An easy way to see the assembly code after compile is on this site.

Having seg a vec instead of a slice / array might bring some extra calculations too. I'll see if I can find an easy change since arrays don't have a dynamic size. I might try a chunk implementation to see what speedup I get.

I'll check if there is any easy way to jump from this implementation to an OpenCI|CUDA one. I'm going back to uni soon and I won't have a lot of free time. Sorry I can't help more.

I'll post the benchmarks on GitHub as soon as I like the level of optimizations I have. I'll see if I can find an ARM processor for the benchmark, but can't guarantee. I don't have any other type of CPU.

jzakiya · January 2, 2019, 6:09pm

Hey Vlad, here's a cleanup|simplification.

I took your code for printing out times and made it a function, to reduce the source code noise. Notice, I made the output be secs, to match the Nim code.

use std::time::SystemTime;

fn print_time(title: &str, time: SystemTime) {
    print!("{} = ", title);
    println!("{} secs", {
        match time.elapsed() {
            Ok(elapsed) => {
                ((elapsed.as_secs() * 1000) as f64 + elapsed.subsec_nanos() as f64 / 1_000_000f64)/1000.0},
            Err(e) => {panic!("Timer error {:?}", e)},
        }
    });
}

The source code is now shorter, and much easier to follow, because now you can do this:

  let ts = SystemTime::now();      // start timing sieve setup
  .....
  .....
  print_time("setup time", ts);    // sieve setup time

  let t1 = SystemTime::now();      // start timing ssoz sieve
  ....
  ....

  print_time("sieve time", t1);    // ssoz sieve time
  print_time("total time", ts);    // sum of sieve and ssoz time
  println!("last segment = {} resgroups; segment slices = {}", Kn, (Kmax - 1)/KB + 1);
  println!("total twins = {}; last twin = {}+/-1", twincnts, last_twin - 1,);

EDIT: A little shorter.

fn print_time(title: &str, time: SystemTime) {
    print!("{} = ", title);
    println!("{} secs", {
        match time.elapsed() {
            Ok(elapsed) => {
                elapsed.as_secs() as f64 + 
                elapsed.subsec_nanos() as f64 / 1_000_000_000f64},
            Err(e) => {panic!("Timer error {:?}", e)},
        }
    });
}

VladMoldoveanu · January 2, 2019, 8:50pm

Thanks, but I changed the way I measure the time for easier benchmarks:

fn time(f: fn(usize) -> (usize, usize), x: usize) -> u32 {
    let now = SystemTime::now();
    let _res = f(x);

    match now.elapsed() {
        Ok(e) => {e.subsec_micros() + e.as_secs() as u32 * 1_000_000},
        Err(e) => {panic!("Timer error {:?}", e)},
    }
}

let time_elapsed: u32 = time(largest_twin_prime_before, max);

I also uploaded the benchmarks on my two laptops as the README.md on GitHub.

Unfortunately, I have to start revising for uni and I'll be pretty busy from now on. I'll keep an eye on this forum and wish you the best!

jzakiya · January 8, 2019, 3:41pm

After resting on coding awhile, and coming back with a fresher mind, I realized I could ensure the twin pair values are always <= num by reducing Kmax per thread if the last value in Kmax was > num. By doing that I could eliminate all the code at the end of twin_sieve that did all the backtracking to find the largest tp val|sum < num. This creates a nice code reduction and simplification, making it probably a little bit faster too.

fn twin_sieve(k_max: usize, kb: usize, r_hi: usize, modpg: usize, num: usize, p_cnt: usize,
              primes: Arc<Vec<usize>>, res_inv: Arc<Vec<usize>>, pos: Arc<Vec<usize>>) -> (usize, usize) {
    //Initialize variables
    let (mut sum, mut ki, mut kn) = (0usize, 0usize, kb);
    let (mut hi_tp, mut upk, mut k_max) = (0usize, 0usize, k_max);
    let mut seg = vec![0u8;kb];
    let mut next_p = next_p_init(r_hi, modpg, primes.clone(),p_cnt, res_inv.clone(), pos.clone());
    if (k_max - 1) * modpg + r_hi > num { k_max -= 1};

    //Consider all resgroup size slices up to k_max
    while ki < k_max {
        if kb > (k_max - ki) {kn = k_max - ki;}
        unsafe {memzero(seg.as_mut_ptr(), kn);}

        //For each prime, mark the multiples of the twin pair
        for (i, &prime) in primes.iter().enumerate() {
            //lower twin
            let mut k = next_p[i];
            while k < kn {
                seg[k] |= 1;
                k += prime;
            }
            next_p[i] = k - kn;

            //higher twin
            k = next_p[p_cnt + i];
            while k < kn {
                seg[k] |= 1;
                k += prime;
            }
            next_p[i + p_cnt] = k - kn;
        }

        //count the number of twins found
        let mut cnt = 0usize;
        seg.iter().take(kn).for_each(|&x| if x == 0 {cnt += 1});

        //Add the number of twin primes found to the local counter 'sum'
        if cnt > 0 {
            sum += cnt;
            // Save the location of the largest prime
            for k in 1..=kn {
                if seg[kn - k] == 0 {
                    upk = kn - k;
                    break;
                }
            }
            hi_tp = ki + upk;
        }
        ki += kb;
    }
    hi_tp = if r_hi > num {0usize} else {hi_tp * modpg + r_hi};
    (hi_tp, sum)
}

Topic		Replies	Views
Help -- Fastest Prime Sieve, in Rust help	22	2768	November 9, 2019
Nim multithreading faster than Rust help	57	10652	April 9, 2020
A humble request/challenge	10	1970	January 12, 2023
Is Nim faster than Rust? code review	40	1328	February 6, 2026
How to make this parallel prime sieve faster?	30	5514	January 12, 2023

A Friendly Challenge for Rust

Related topics