Performance: strided sum of iterator (and rayon)

Hello,

In the following I demonstrate strided summation of an iterator for improved performance. My question is whether something like that can be implemented using only built-in functionality (i.e. without the sum2 function).

My ultimate goal is to use rayon. The original program can be parallelized simply by injecting a call to into_par_iter. It would be nice if the benefits of strided summation and parallelization could be harvested without writing specialized code.


Here is a small program that computes the sum of the items of an iterator. (For fun the result approximates π, see the Leibniz formula for π.)

use std::time::Instant;

fn pi(n_terms: usize) -> f64 {
    let iter = (0..n_terms)
        .step_by(2)
        .map(|i| {
            let d = (1 + 2 * i) as f64;
            1.0 / (d * (d + 2.0))
        });
    8.0 * iter.sum::<f64>()
}

fn main() {
    const N: usize = 1_000_000_000;
    let start = Instant::now();
    let pi = pi(N);
    let duration = start.elapsed();
    println!("π ≈ {}", pi);
    println!("Time elapsed: {:.4} s", duration.as_secs_f64());
}

Now it happens that the items of the iterator can be computed independently. But the fact that they are summed sequentially prevents modern CPUs from fully utilizing their compute units.

We can speed up the above program by performing strided iteration. For example, we could sum even and odd items into separate accumulators and only sum the two accumulators at the end. This can be realized by adding the following function

fn sum2<I: Iterator<Item=f64>>(mut iter: I) -> f64 {
    const N: usize = 2;
    let mut acc = [0.0; N];
    'outer: loop {
        for i in 0..N {
            match iter.next() {
                Some(x) => acc[i] += x,
                None => break 'outer,
            }
        }
    }
    acc.iter().sum()
}

and replacing iter.sum::<f64>() by sum2(iter).

On my old laptop the program (release build) now runs twice as fast.

This is noteworthy because iterators actually do not guarantee that their iterations are independent. In the above case I guess that the compiler deduces this after inlining and so can perform the optimization nevertheless.

The word you're looking for here is "vectorization".

See point 4 in the following post for why this usually doesn't happen automatically with f64:

Thanks a lot for the interesting pointers!

The nightly support for SIMD looks useful, but I think that strided reductions can be useful beyond SIMD (I can imagine that a CPU core can pipeline multiple independent SIMD operations).

Turns out that what I was looking for was the "chunks" support of the built-in iterators. However these only work with slices (and not with ranges). This will be just what is needed in most cases: in my experience most real-world vectorizable algorithms involve arrays of data stored in memory.

Still, it turns out that my artificial zero-data example can be vectorized using built-in features (although only available in nightly):

#![feature(iter_array_chunks)]

fn pi(n_terms: usize) -> f64 {
    const L: usize = 4;

    let chunks = (1..(2 * n_terms + 1))
        .step_by(4)
        .array_chunks::<L>();

    let sums = chunks.fold([0.0; L], |mut acc, chunk| {
        for i in 0..L {
            let d = chunk[i] as f64;
            acc[i] += 1.0 / (d * (d + 2.0));
        }
        acc
    });
    // MISSING: Process the reminder.
    8.0 * sums.iter().sum::<f64>()
}

This array_chunks is not supported by Rayon, however. Rayon's indexed parallel iterators do have their own support for "chunks", but it involves allocaton of vectors, so it's not appropriate here. For vectorizing operations over slices of memory Rayon has as_parallel_slice.

If you're looking for this on general Iterators, see also https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.chunks and https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.tuples.

I have draft PRs -- just hoping to see std stabilize first so we can match API.

3 Likes

I read the portable SIMD RFC. This is so cool! Programmers have been waiting for something comparable at least since MMX came up.

Still, I have the impression that in their current state portable SIMD and Rayon do not combine very well: All examples in the SIMD RFC use explicit loops, while Rayon relies on implicit iteration.

I see why this is technically challenging, but given that Rayon does not provide any guarantees regarding the order of reduction (i.e. it doesn't care about floating point operations not being strictly associative), it might be possible to come up with a crate that given an iterator will not only parallelize (like rayon) but also vectorize the computation.

This is generally achieved through the so called "loop unrolling".

Not sure I understand what you mean. (For sure I know about loop unrolling.)

What I tried to say is that even if one has an iterator of, say, Simd<f64, 4>, simply summing up that iterator using the iterator's sum method may not fully utilize the hardware, because the subsequent summations (unrolled or not) are not independent: they all share a single common accumulator.

Explicit SIMD could be better, but it's not too bad on the implicit side. Rayon lowers to regular iteration most of the time after it's done with parallel splits, and I have definitely seen auto-vectorization go to work on those loops when building with sufficient features enabled, like RUSTFLAGS=-Ctarget-cpu=native. You can also exercise methods like with_min_len or fold_chunks to ensure an appropriate size runs sequentially.

2 Likes