Stack overflow when iterating over very large collection


#1

I have a friend benchmarking addition of two arrays in Ruby each with 100 million elements. I wanted to do the same in Rust and compare performance but I’m getting stack overflow errors if I go much further beyond 500 thousand items in each array.

Here are my two benchmarks that work:

#![feature(test)]
extern crate test;
use test::Bencher;

const LENGTH: usize = 500000;

#[bench]
fn add_replace(d: &mut Bencher) {
  d.iter(|| {
    let mut a = [2_u64; LENGTH];
    let b = [3_u64; LENGTH];

    for i in 0..LENGTH {
      let v = a[i];
      std::mem::replace(&mut a[i], v + b[i]);
    }
  });
}

#[bench]
fn add_push(d: &mut Bencher) {
  d.iter(|| {
    let a = [2_u64; LENGTH];
    let b = [3_u64; LENGTH];

    let mut c: Vec<u64> = Vec::with_capacity(LENGTH);

    for i in 0..LENGTH {
      c.push(a[i] + b[i]);
    }
  });
}

And the results:

running 2 tests
test add_push    ... bench:   3,238,045 ns/iter (+/- 126,598)
test add_replace ... bench:   1,523,138 ns/iter (+/- 48,715)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

But if you change the 500000 to 600000 I get this error on both benches:

running 2 tests
test add_push    ... 
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
error: An unknown error occurred

Running --verbose offers no help at all here.

Can you help me get past these limitations? And can you help clarify what limitations I’m hitting here? Thanks!


Surprising Auto-Vectorization/Optimization Behavior
#2

You are allocating arrays in-place (on the stack). You have two options, the first simple option is to allocate your arrays on the heap, using:

let a = vec![2_u64; LENGTH];

Instead of:

let a = [2_u64; LENGTH];

Your other option is to increase the stack space of your process. Unfortunately I think you can’t do this with Rustc command line arguments, so in your program you have to create a new thread with a larger stack.

Another solution to try:

const LENGTH: usize = 500_000;

#[bench]
fn add_vecs(d: &mut Bencher) {
    d.iter(|| {
        let v1 = vec![2_u64; LENGTH];
        let v2 = vec![3; LENGTH];

        let vr: Vec<_> = v1
                         .iter()
                         .zip(&v2).map(|(&x, &y)| x + y)
                         .collect();
    });
}

#3

Thanks! That helps a lot. I’ve done some experimenting and found the largest array of u64 to be a length of 522989.

Testing vec vs array with length of 522989 and adding in your example I get the following results:

running 5 tests
test add_push          ... bench:   8,434,232 ns/iter (+/- 637,408)
test add_replace       ... bench:   5,105,690 ns/iter (+/- 705,024)
test add_zip           ... bench:   7,674,469 ns/iter (+/- 794,977)
test array_add_push    ... bench:   4,709,414 ns/iter (+/- 303,907)
test array_add_replace ... bench:   1,553,370 ns/iter (+/- 403,634)

test result: ok. 0 passed; 0 failed; 0 ignored; 5 measured; 0 filtered out

It’s nice that I can now work on longer collections. I see there is a cost to doing it this way though. But I suppose processing data dynamically would require this anyways.

Thanks again!


#4

I would expect some difference (as dynamic allocation has a cost and stack-allocated data is easier for a compiler to reason about), but that seems a bit much. Can you share the code of the latest versions, just to have a look at the generated ASM out of curiosity?


#5

Sure!

#![feature(test)]
extern crate test;
use test::Bencher;

const LENGTH: usize = 522989;

#[bench]
fn array_add_replace(d: &mut Bencher) {
  d.iter(|| {
    let mut a = [2_u64; LENGTH];
    let b = [3_u64; LENGTH];

    for i in 0..LENGTH {
      let v = a[i];
      std::mem::replace(&mut a[i], v + b[i]);
    }
  });
}

#[bench]
fn array_add_push(d: &mut Bencher) {
  d.iter(|| {
    let a = [2_u64; LENGTH];
    let b = [3_u64; LENGTH];

    let mut c: Vec<u64> = Vec::with_capacity(LENGTH); 

    for i in 0..LENGTH {
      c.push(a[i] + b[i]);
    }
  });
}

#[bench]
fn add_replace(d: &mut Bencher) {
  d.iter(|| {
    let mut a = vec![2_u64; LENGTH];
    let b = vec![3_u64; LENGTH];

    for i in 0..LENGTH {
      let v = a[i];
      std::mem::replace(&mut a[i], v + b[i]);
    }
  });
}

#[bench]
fn add_push(d: &mut Bencher) {
  d.iter(|| {
    let a = vec![2_u64; LENGTH];
    let b = vec![3_u64; LENGTH];

    let mut c: Vec<u64> = Vec::with_capacity(LENGTH); 

    for i in 0..LENGTH {
      c.push(a[i] + b[i]);
    }
  });
}

#[bench]
fn add_zip(d: &mut Bencher) {
  d.iter(|| {
    let v1 = vec![2_u64; LENGTH];
    let v2 = vec![3_u64; LENGTH];

    let _vr: Vec<_> = v1
      .iter()
      .zip(&v2).map(|(&x, &y)| x + y)
      .collect();
  });
}

#6

So, after tweaking the code a bit to get the asm from the playground (which does not support #[bench]), I think I understand better what’s going on.

In add_replace, for example, the compiler manages to elide the bound checks associated with indexing in the array-based version, but not in the vec-based version. This has a number of harmful consequences for the latter:

  • On every inner loop iteration, the vec-based version must spend extra time to check if Vec indexing is in bounds. The array version performs no such check, as the compiler could prove that out-of-bounds won’t happen.
  • The inner loops of the array-based version were quite liberally unrolled and vectorized by the compiler, whereas the inner loops of the Vec-based version are very conservative and only performs one operation at a time, which greatly reduces their efficiency.

The codegen for add_zip is also pretty bad, but in a different way. Although it doesn’t have bound checks he asm is very complex for something which is ultimately just a vector addition. I’m not sure what’s going on there.

Just out of curiosity, why do you use std::mem::replace instead of a mere a[i] += b[i]?

Another thing which I noted is that your benchmarks include the initialization of the a and b arrays/vecs. This seems unrelated to what you are trying to measure here, was this intended? If not, you may want to store a and b in static variables, or lazy_statics for the vecs…

EDIT: Here’s a variant of my playground-friendly version that does all of the above. If you could benchmark this one, I’d be curious about the results :slight_smile:

https://play.rust-lang.org/?gist=262884b25551dd85c880518c749fa91a&version=nightly&mode=release


#7

When I tried my friend’s Ruby benchmark my computer wasn’t able to handle everything in memory at 100 million integers with my 8GBs of RAM. So I chose to re-use the same space in RAM rather than consuming that much more. It turns out it’s a non-issue for me in Rust though.

Oops… += does the same thing doesn’t it. ¯\_(ツ)_/¯

Sure. I substituted

#[inline(never)]
fn function_name() {

for

#[bench]
fn function_name(d: &mut Bencher) {
  d.iter(|| {

And ran your code and here are the results.

running 5 tests
test add_push          ... bench:   3,913,203 ns/iter (+/- 214,706)
test add_replace       ... bench:   3,161,929 ns/iter (+/- 146,517)
test add_zip           ... bench:   1,963,979 ns/iter (+/- 50,868)
test array_add_push    ... bench:   2,009,887 ns/iter (+/- 154,225)
test array_add_replace ... bench:   1,743,657 ns/iter (+/- 63,325)

test result: ok. 0 passed; 0 failed; 0 ignored; 5 measured; 0 filtered out

Your implementation is definitely faster overall.

Not intended… but is consistent with the Ruby code my friend implemented.


#8

Then I would propose changing it in the Ruby code as well :slight_smile: Because that is closer to what you initially meant to benchmark, and as you can see it changes the picture quite a bit for the add_zip version vs the array-based versions.


#9

You don’t even need that; you can just define them before the call to .iter() rather than inside it! :smiley:


#10

I think the issue comes down to two things:

  1. You are allocating and de-allocating memory each iteration. This is more complicated than advancing the stack frame a fixed amount. You can change this by lifting the allocations outside the iter() loop.

  2. The compiler needs a bit of help to know that the vectors have the same length, whereas it does know this for the arrays. You can add some hints (in just a moment).

When I run your initial benchmarks, I got

test add_replace       ... bench:   5,056,937 ns/iter (+/- 607,143)
test array_add_replace ... bench:   1,353,722 ns/iter (+/- 112,698)

The first thing I did was to lift the vector allocations outside the iter loop, but repopulate them each iteration (with 2 and 3). This changed the times to

test add_replace       ... bench:   2,045,118 ns/iter (+/- 351,062)
test array_add_replace ... bench:   1,371,831 ns/iter (+/- 183,852)

If you then change things again by putting just in front of your for i in 0 .. LENGTH loop (in add_replace):

    let a = &mut a[..b.len()];
    let b = &b[..a.len()];

This clues the compiler in to the fact that these will end up with the same length, and it can turn on vectorizations. I’m not entirely clear why both of these are needed, but they both seemed to be. They produce:

test add_replace       ... bench:   1,419,173 ns/iter (+/- 264,335)
test array_add_replace ... bench:   1,376,418 ns/iter (+/- 161,324)

which seems to be getting close to the margin of error. At this point, about 2/3 of what you are measuring is the data population, rather than the summation. If you remove the data population (causing a to count up from 2 by multiples of 3) you get:

test add_replace       ... bench:     695,770 ns/iter (+/- 121,895)
test array_add_replace ... bench:     586,910 ns/iter (+/- 84,320)

Anyhow, I hope this was informative; It may or may not help with whatever you need, but the two approaches should be very similar once you get the same optimizations in place. Also, I totally recommend https://rust.godbolt.org in case you want to check out what the generated code looks like, and in particular how you have to hit the code to vectorize it.


#11

Since they are always the same, I thought that factoring them out in a single place made more sense :slight_smile: