Why `vec![]` works faster than others?

Benchmark Codes:


#[macro_use]
extern crate criterion;
use criterion::Criterion;

use rand::distributions::{Alphanumeric, Distribution, Standard};
use rand::{rngs::StdRng, Rng, SeedableRng};

/// Returns fixed seedable RNG
pub fn seedable_rng() -> StdRng {
    StdRng::seed_from_u64(42)
}

fn gen_array<T>(size: usize) -> Vec<T>
where
    Standard: Distribution<T>,
{
    let mut rng = seedable_rng();
    (0..size).map(|_| rng.gen()).collect::<Vec<T>>()
}

fn add_benchmark(c: &mut Criterion) {
    let size = 1048576;
    let i32_array: Vec<i32> = gen_array(size);

    c.bench_function("vabs_d", |b| {
        b.iter(|| criterion::black_box(vabs_d(&i32_array)))
    });

    c.bench_function("vabs_v", |b| {
        b.iter(|| criterion::black_box(vabs_v(&i32_array)))
    });

    c.bench_function("vabs_e", |b| {
        b.iter(|| criterion::black_box(vabs_e(&i32_array)))
    });
}

fn vabs_d(array: &Vec<i32>) -> Vec<u32> {
    let it = array.iter().map(|x| x.abs() as u32);
    Vec::<u32>::from_iter(it)
}

fn vabs_v(array: &Vec<i32>) -> Vec<u32> {
    let mut data = vec![0u32; array.len()];
    data.iter_mut().zip(array.iter()).for_each(|(i, x)| {
        *i = x.abs() as u32;
    });
    data
}

fn vabs_e(array: &Vec<i32>) -> Vec<u32> {
    let mut data = Vec::with_capacity(array.len());
    unsafe { data.set_len(array.len()) }
    data.iter_mut().zip(array.iter()).for_each(|(i, x)| {
        *i = x.abs() as u32;
    });

    data
}

criterion_group!(benches, add_benchmark);
criterion_main!(benches);

Results:

vabs_d                  time:   [347.66 us 356.23 us 364.24 us]                   
                        change: [-5.5815% -3.2435% -0.6228%] (p = 0.01 < 0.05)


vabs_v                  time:   [247.92 us 255.90 us 264.47 us]                   
                        change: [-8.5961% -4.5200% -0.5549%] (p = 0.03 < 0.05)


vabs_e                  time:   [327.41 us 335.39 us 343.17 us]                   
                        change: [-2.5623% -0.8672% +0.8043%] (p = 0.30 > 0.05)


Benches run on my 2021 Mac Pro M1Max.
Toolchain: nightly-2021-11-30-aarch64-apple-darwin

I did see any reason that vabs_v works 50% faster than others.
Compared with vabs_e, it did initialize for the entry vec, so it supposes to be much slower.

I'm not familiar with asm codes, Compiler Explorer

1 Like

For the record, vabs_e is UB.

I'm on mobile right now, so someone else could explain more in-depth, but you can't crate (and iterate over) a Vec out of uninitialized memory like that. You need to use MaybeUninit items and convert the vector later, or use pointers to write the data before calling set_len.

6 Likes

I don't have an answer for you, but a few side-nodes.

An array [T; N], which is fixed-length, is distinct from a Vec<T> in Rust. So you may want to use different names.

This is undefined behavior (UB), so it doesn't really matter how fast or slow it is, don't use it:

fn vabs_e(array: &Vec<i32>) -> Vec<u32> {
    let mut data = Vec::with_capacity(array.len());
    unsafe { data.set_len(array.len()) }
    data.iter_mut().zip(array.iter()).for_each(|(i, x)| {
        *i = x.abs() as u32;
    });

    data
}

A sound version would probably use MaybeUninit.

But incidentally, the optimizer almost certainly doesn't actually initialize the Vec to all 0s in vabs_v. (Many allocators do this by default anyway, and even if not, it probably sees you overwrite the values without reading them.)

2 Likes

Uninitialized data is invalid data, and when you iterate over those values, you're creating references to invalid values, which is UB. You also can't rely on the internal implementation of Vec to not create references, so as mentioned above, you need to either use a Vec of MaybeUninit or obtain a raw pointer and write all your data (using pointer writes, and never reading or creating references) and then set_len or use from_raw_parts, etc.

But again, the compiler probably does it as well or better.

4 Likes

Thanks for the suggestions of UB, I used CPP before and I did not realize that.

Changed to:

fn vabs_e(array: &Vec<i32>) -> Vec<u32> {
    let mut data = Vec::with_capacity(array.len());
    let data_mut = data.as_mut_ptr() as *mut u32;

    unsafe {
        array.iter().enumerate().for_each(|(i, x)| {
            std::ptr::write(data_mut.add(i), x.abs() as u32);
        });
        data.set_len(array.len())
    }
    data
}

vabs_e time: [319.82 us 326.58 us 333.52 us]

Still did not figure out why the vabs_v had better performance.
Is it mean that we should always use vec![] to initial an array instead of from_iter and with_capacity + ptr::write ?

1 Like

It's not incident but some handwritten code in stdlib.

impl<T: Clone + IsZero> SpecFromElem for T {
    #[inline]
    fn from_elem<A: Allocator>(elem: T, n: usize, alloc: A) -> Vec<T, A> {
        if elem.is_zero() {
            return Vec { buf: RawVec::with_capacity_zeroed_in(n, alloc), len: n };
        }
        let mut v = Vec::with_capacity_in(n, alloc);
        v.extend_with(n, ExtendElement(elem));
        v
    }
}
5 Likes

"Always" is a strong assertion. There's no guarantee things will act the same in every circumstance, nor that the performance of the different approaches won't change in the future.

So I'd go with whatever safe version is most convenient to maintain until profiling indicated that the cost of initialization is a relevant concern at all. Even then, I'd only reach for an unsafe solution in extreme cases; it's stdlib's job to soundly wrap up the unsafety so that I, and other maintainers of my crate, don't have to worry about it.

(N.b. I still haven't tried to pinpoint why you're seeing a difference.)

Just a figure of speech. That's part of the mechanism that allows allocators to not zero the memory if it's already known to be zeroed (e.g. by the kernel).

4 Likes

One potential explanation is if Vec::from_iter is reƤllocating rather than preƤllocating. iter::Map<array::Iter<T>> is ExactSizeIterator + TrustedLen, though, so I'd think that wouldn't be the case, and std would have specialization to allocate only once.

An interesting case to add to the comparison:

let mut data = Vec::with_capacity(len);
data.extend(array);
data

The only real reason I can see _v being faster than all of the other versions is that _v manages to vectorize with simd and the others aren't seen through by the vectorization optimization pass.

I'd write it as:

fn vabs_c(data: &[i32]) -> Vec<u32> {
    data.iter().map(|x| x.abs() as u32).collect()
}

... and according to my benchmark (rust 1.57.0 on a 2017 linux x86), that is actually the fastest (or it was in the first run). But more surprising, vabs_v, that was fastest in your benchmark, seems to be consistently slowest in mine (while the difference between c, d, and e is smaller than the difference between consecutive runs, so I'm not sure there is a difference).

vabs_c                  time:   [271.48 us 280.62 us 290.40 us]
vabs_d                  time:   [267.45 us 277.01 us 287.13 us]
vabs_v                  time:   [341.64 us 351.19 us 360.72 us]
vabs_e                  time:   [255.40 us 263.52 us 272.73 us]

So, conclusion? Benchmarking is hard, I guess. :slight_smile:

Edit: I tried with nightly (1.59.0-nightly (fcef61230 2021-12-17)) as well, and got similar results. So I guess the difference here is between x86_64 and and aarch64.

vabs_c                  time:   [255.83 us 257.50 us 259.32 us]
vabs_d                  time:   [276.07 us 284.53 us 293.09 us]
vabs_v                  time:   [338.88 us 343.13 us 347.59 us]
vabs_e                  time:   [262.44 us 264.56 us 266.76 us]
6 Likes

Ok, I think it's the reason of auto vectorized in the different archs, thanks for all your guys.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.