Need help writing an infinite prime iterator

I'm trying to write an iterator that generates primes forever, but I'm running into stack overflow and performance issues.

Here's what I have:

fn primes() -> Sieve {
    Sieve {
        iter: Some(Box::new(2..)),
    }
}

struct Sieve {
    iter: Option<Box<dyn Iterator<Item = u32>>>,
}

impl Iterator for Sieve {
    type Item = u32;

    fn next(&mut self) -> Option<Self::Item> {
        // Option used to take ownership out of &mut Self
        let mut iter = self.iter.take()?;
        let prime = iter.next()?;
        self.iter = Some(Box::new(iter.filter(move |n| n % prime != 0)));
        Some(prime)
    }
}

#[test]
fn generate_first_10() {
    let first_10: Vec<u32> = primes().take(10).collect();
    assert_eq!(first_10, vec![2, 3, 5, 7, 11, 13, 17, 19, 23, 29]);
}

Here's the "Python pseudo code", it's essentially the Sieve of Eratosthenes, but it's powered by an infinite number generator instead of a fixed Vec, and each number gets filtered away if it's divisible by a past prime:

import itertools

def sieve():
    yield from _sieve(itertools.count(2))

def _sieve(numbers):
    p = next(numbers)
    yield p
    yield from _sieve(n for n in numbers if n % p != 0)

The Python code is obviously recursive and hits the recursion limit very quickly (right after 3557), but the Rust code isn't recursive, and for the life of me I can't figure out why I'm getting a stack overflow panic while trying to do for p in primes() { println!("{}", p); } (curiously it varies where it gives up on life, but always in the 338000s on my Linux machine).

The other issue is performance, where a sloppily written Vec based sieve implementation is over 2000x faster for generating primes up to 338000. I have a feeling it's the many Box::new() allocations, but I don't know how to start profiling this. Could this be pre-allocated/amortized somehow? Is there a way to use generics/monomorphization? Is there some unsafe magic I could be using? Are the unreleased generators what I'm looking for? I don't want to give up this idea just yet.

I'm a little stuck, all insights would be very helpful :slight_smile:

FYI, I'm intending to use this for competitive programming (great for lazily finding prime factors for example), so I can't use libraries, but don't mind using unsafe code.

The Rust code isn't obviously recursive, but it's still recursive.

Specifically, each iteration you're wrapping iter in another layer of filter, which means that each time you get a prime, getting the next needs to recurse through one more layer of filtering.

If you want to avoid the recursion without changing the algorithm, store a Vec of primes directly, and then check that numbers are relatively prime to every number in that list.

However, it's worth noting that this isn't a sieve! It's just trial division in a nice functional presentation. This is likely why your fixed sieve is doing so much better, on top of the indirection penalties. For an example unbounded lazy sieve, see this implementation in Swift, this implementation in Kotlin, or this wheeled implementation in Python.

But it's also worth noting that the biggest bottleneck in the plain iteration is not actually that it's doing trial division, but that it's doing so naively; if you only add a divisor candidate once you reach its square, that basically gets you a true unbounded sieve.

6 Likes

An approach I've used before is to use a fixed sieve to generate a Vec of primes up to some reasonable limit (which can be computed up front quite fast), and then have an iterator which

  • runs through the cached primes, then
  • uses Miller-Rabin with some suitable set of bases after that

I suspect there's a point where extending a Vec of known primes on demand is faster with Miller-Rabin than sieving (but wasn't extending my cache, so never measured). You'll probably want Miller-Rabin in your bag of tricks anyway, so you can quickly test primality without generating all lesser primes when that makes sense.

2 Likes

Your problem is algorithmic complexity.

The Python sieve function does trial division by all smaller prime numbers. The complexity of generating up to n this way is O(n^2 / log^2 n). You would do a lot better just doing trial division by all numbers up to sqrt(n).

The best way is to run the standard sieve of Eratosthenes on a bit array, which has complexity O(n log log n). After you run out of the array, double the size and re-do it, which will still amortize to O(n log log n).

To reduce memory use from O(n) bits to O(sqrt(n)) bits you can use the segmented version of the sieve: you only need to store O(sqrt(n)) bits at a time.

Miller-Rabin is worse for this task: not only is it probabilistic and more complicated, it is also slower asymptotically: O(n log n).

7 Likes

For fun, I decided to write one of these in Rust. I haven't profiled it, and I'm sure there's plenty of ways to improve its performance.

(It also might not meet the definition of a sieve, as it uses a priority queue in place of an array)

Edit: Make sure code behaves properly at the top end of the integer range

type UINT = u16;

struct Sieve {
    candidate: UINT,
    multiples: BinaryHeap<Reverse<(UINT,UINT)>>
}

impl Sieve {
    pub fn new()->Self {
        Sieve {
            candidate: 2,
            multiples: Default::default()
        }
    }
}

impl Iterator for Sieve {
    type Item = UINT;
    fn next(&mut self)->Option<UINT> {
        let mut candidate;
        loop {
            if self.candidate == 0 {
                // Overflow
                return None;
            }
            candidate = self.candidate;
            self.candidate = candidate.wrapping_add(1);
            let mut is_prime = true;
            loop {
                match self.multiples.peek() {
                    Some(&Reverse((c,base))) if c == candidate => {
                        self.multiples.pop();
                        is_prime = false;
                        if let Some(mult) = c.checked_add(base) {
                            self.multiples.push(Reverse((mult, base)));
                        }
                    },
                    _ => { break; }
                }
            }
            if is_prime { break; }
        }
        if let Some(square) = candidate.checked_mul(candidate) {
            self.multiples.push(Reverse((square, candidate)));
        }
        return Some(candidate);
    }
}
3 Likes

The Python sieve function does trial division by all smaller prime numbers. The complexity of generating up to n this way is O(n^2 / log^2 n). You would do a lot better just doing trial division by all numbers up to sqrt(n).

You can do even better than that pretty easily. Do trial division by all primes under sqrt(n), then the complexity is about O(n*sqrt(n)/ln(n)).

Of course, a segmented sieve would be much better, but would be harder to add that to a working program than simply having a if cur_prime**2 > n: break.

And although Miller-Rabin is probabilistic, there are deterministic versions which have been checked for all numbers under 2^64 (and this is for competitive programming, so there shouldn't be any numbers bigger than 2^64 that need to be primality-checked).

(The link I posted earlier is to the base sets needed for the deterministic version, for those who want them.)

The Genuine Sieve of Eratosthenes is an excellent paper that explains the difference between Eratosthenes' sieve and the trial division algorithm often found masquerading under his name. I don't know Haskell but I still found the analysis to be fairly approachable. It also outlines a few streaming versions of the true sieve, one of which looks rather similar to @2e71828's code.

3 Likes

It looks like mine is halfway between the versions described in sections 3 and 3.1: I am storing a fixed increment in the priority queue, but O'Neill jumps straight to storing an iterator. This lays the groundwork for the wheel-optimized implementation described in 3.2.

Updating my implementation to a wheel-based one could be a reasonable exercise for somebody; the rough plan would be:

  • Define a WheelIter that produces the wheel sequence multiplied by an arbitrary base value
  • Define an Ord method for WheelIter that sorts by the next value to be produced (descending)
  • Redefine Sieve as struct Sieve { candidates: WheelIter, multiples: BinaryHeap<WheelIter> }
  • Add a special case to produce the small primes that were used to generate the wheel

The only reason this paper suggests using a binary heap is to avoid using arrays, because she wants a "purely functional" implementation and considers arrays non-functional (but Haskell has arrays!).

This adds an unnecessary log factor to the runtime.

Using a simple array to cross out multiples has better asymptotic complexity and is likely to perform better in practice too.

I can't find anything in the paper that supports this assertion. By my reading, she chooses a multimap (later refined to a heap) to avoid precalculating lots of multiples of the found primes-- The traditional array approach requires a predefined upper bound for the search space, which doesn't easily translate into an infinite-length iterator of primes.

At least, that's the reason I chose one when writing my implementation (before I had even heard of the paper).

Just double the array size every time you reach the end and recompute. It still amortizes to the same total complexity, O(n log log n).

With a segmented sieve it's even more natural. The segments become twice as long, and you continue. You need to add more small primes periodically to make sure they go up to sqrt(n), but it's easy: double the limit for small primes when necessary and recompute them. The time spent doing this is insignificant because it is dominated by sieving the segments anyway.

Here are quotes from the paper.

But while the current state of the art in prime-number sieves is interesting, current techniques are largely tangential to the purposes of this paper, which was merely to reproduce the original Sieve of Eratosthenes as faithfully as possible in a lazy functional form. Incremental algorithms exist for performing sieving (Bengelloun, 1986; Pritchard, 1994), but these algorithms are typically array-based.

The time complexity of this algorithm appears slightly worse than typical imperative renditions of the sieve because of the Θ(log n) cost of using a a tree, thus the performance for finding all primes less than n is Θ(n log n log log n),

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.