out of pure curiosity and coincidence, I've tried to benchmark the following simple example:
pub fn test_iter() -> Vec<usize> {
(0..12)
.filter(|x| x % 2 == 0)
.map(|x| x * x)
.collect()
}
pub fn test_for() -> Vec<usize> {
let n = 12;
let mut result = Vec::with_capacity(n);
for x in 0..n {
if x % 2 == 0 {
result.push(x * x);
}
}
result
}
To my surprise, the for loop performs 5x faster than the .iter() version on my Core-i7, rust 1.42.
$ cargo bench
Compiling bench_iter v0.1.0 (/home/td/projects-sandbox/rust_bench_iter)
Finished bench [optimized] target(s) in 3.47s
Running target/release/deps/bench_iter-e0df5c1fe8074663
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/release/deps/my_benchmark-7c1f3cac9736b209
iter time: [123.41 ns 123.51 ns 123.62 ns]
change: [-28.213% -28.099% -27.980%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
for time: [23.755 ns 23.777 ns 23.799 ns]
change: [+0.4685% +0.6315% +0.8007%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
Is it me doing something stupid with the benchmark, or is actually Rust performing poorly for this particular case?
For 1200 elements (just added two zeroes), the results are much better
Running target/release/deps/my_benchmark-7c1f3cac9736b209
iter time: [1.4242 us 1.4257 us 1.4272 us]
change: [-0.2425% -0.0512% +0.1491%] (p = 0.60 > 0.05)
No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low severe
2 (2.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
for time: [1.3622 us 1.3633 us 1.3644 us]
change: [+11.248% +11.412% +11.587%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
The .iter() solution is now more or less on par with the for() loop. So I guess it's mostly the setup that's a lot more expensive with the iter() version. I feel a lot better now, although it still seems there can be a major performance drop if the iter() version with low iteration count is used in a tight loop.
Sadly I'm on mobile right now and can't test it myself:
What happens if you put the iterator in its own type and provide a custom size_hint (or even ExactSizeIterator)?
Or the other way around: replace the with_capacity with just a new. I would imagine the speed should be the same then.
pub fn test_for() -> [usize; 6] {
let n = 12;
let mut index = 0;
let mut result: [usize; 6] = [0; 6];
for x in 0..n {
if x % 2 == 0 {
result[index] = x * x;
index += 1;
}
}
result
}
Oh i'm suprised that in this version the Iterator version is still way slower
pub fn test_iter_new() -> Vec<usize> {
(0..1200)
.filter(|x| x % 2 == 0)
.map(|x| x * x)
.collect()
}
pub fn test_iter_with_capacity() -> Vec<usize> {
let n = 1200;
(0..n)
.filter(|x| x % 2 == 0)
.map(|x| x * x)
.fold(Vec::with_capacity(n), |mut result, x| {
result.push(x);
result
})
}
pub fn test_for_with_capacity() -> Vec<usize> {
let n = 1200;
let mut result = Vec::with_capacity(n);
for x in 0..n {
if x % 2 == 0 {
result.push(x * x);
}
}
result
}
test benchs::iter_new ... bench: 2,136 ns/iter (+/- 55)
test benchs::iter_with_capacity ... bench: 7,533 ns/iter (+/- 131)
test benchs::for_with_capacity ... bench: 1,419 ns/iter (+/- 32)
I notice that so far all the with_capacitys are overallocating by a factor of 2, which probably does not affect speed very much, but is wasteful for large n.
FWIW I would use extend rather than fold if you wanted to combine the approaches:
pub fn test_iter_with_extend() -> Vec<usize> {
let n = 1200;
let mut result = Vec::with_capacity((n + 1) / 2);
result.extend((0..n).filter(|x| x % 2 == 0).map(|x| x * x));
result
}
No idea if this is faster, it just seems more natural to me for gathering into a collection.