Why is moving a Arc<Vec<[u8; N]>> much faster than Vec<[u8; N]>?

Moving an Arc<Vec<_>> across threads is hundreds of times fast. Is the compiler optimizing away the results? If I change the number of vecs moved from 10k to 1k, the behaviour is normal again.

Results for N=10000:

Benchmarking move_vec: Warming up for 3.0000 s
move_vec                time:   [275.65 ms 282.99 ms 291.80 ms]

Benchmarking move_arc_vec_st: Warming up for 3.0000 s
move_arc_vec            time:   [277.06 ms 280.58 ms 284.38 ms]

Benchmarking move_arc_vec_spawn: Warming up for 3.0000 s
move_arc_vec #2         time:   [1.0580 ms 1.1929 ms 1.2892 ms]

Results for N=1000:

move_vec                time:   [36.155 ms 36.847 ms 37.509 ms]

move_arc_vec_st         time:   [35.767 ms 36.193 ms 36.636 ms]

move_arc_vec_spawn      time:   [36.266 ms 36.568 ms 36.914 ms]

Code:

use std::{sync::Arc, thread};

use criterion::{black_box, criterion_group, criterion_main, Criterion};

const VEC_SIZE: usize = 500;
const ELEMENT_SIZE: usize = 1232;
const NUM_VECS: usize = 10000;

use rand::{thread_rng, RngCore};

fn prepare_vec() -> Vec<[u8; ELEMENT_SIZE]> {
    let mut vec = vec![[0u8; ELEMENT_SIZE]; VEC_SIZE];
    let mut rng = thread_rng();

    for arr in vec.iter_mut() {
        rng.fill_bytes(arr);
    }

    vec
}
fn benchmark_move_vec(c: &mut Criterion) {
    c.bench_function("move_vec", |b| {
        b.iter_batched(
            || {
                // Setup: Create channel and pre-fill with 1000 Vecs
                let (sender, receiver) =
                    crossbeam_channel::bounded::<Vec<[u8; ELEMENT_SIZE]>>(100000);
                for _ in 0..NUM_VECS {
                    sender.send(prepare_vec()).unwrap();
                }
                receiver
            },
            |receiver| {
                // Benchmark: Receive the Vecs in a separate thread
                let handle = thread::spawn(move || {
                    for _i in 0..NUM_VECS {
                        let vec = receiver.recv().unwrap();
                        black_box(vec);
                    }
                });
                handle.join().unwrap();
            },
            criterion::BatchSize::LargeInput,
        )
    });
}
fn benchmark_move_arc_vec_slow(c: &mut Criterion) {
    c.bench_function("move_arc_vec_st", |b| {
        b.iter_batched(
            || {
                // Setup: Create channel and pre-fill with 1000 Arc<Vec>s
                let (sender, receiver) =
                    crossbeam_channel::bounded::<Arc<Vec<[u8; ELEMENT_SIZE]>>>(100000);
                for _ in 0..NUM_VECS {
                    sender.send(Arc::new(prepare_vec())).unwrap();
                }
                receiver
            },
            |receiver| {
                for _i in 0..NUM_VECS {
                    let arc_vec = receiver.recv().unwrap();
                    black_box(arc_vec);
                }
            },
            criterion::BatchSize::LargeInput,
        )
    });
}

fn benchmark_move_arc_vec_fast(c: &mut Criterion) {
    c.bench_function("move_arc_vec_spawn", |b| {
        b.iter_batched(
            || {
                // Setup: Create channel and pre-fill with 1000 Arc<Vec>s
                let (sender, receiver) =
                    crossbeam_channel::bounded::<Arc<Vec<[u8; ELEMENT_SIZE]>>>(100000);
                for _ in 0..NUM_VECS {
                    sender.send(Arc::new(prepare_vec())).unwrap();
                }
                receiver
            },
            |receiver| {
                // Benchmark: Receive the Arc<Vec>s in a separate thread
                let handle = thread::spawn(move || {
                    for _i in 0..NUM_VECS {
                        let arc_vec = receiver.recv().unwrap();
                        black_box(arc_vec);
                    }
                });
                handle.join().unwrap();
            },
            criterion::BatchSize::LargeInput,
        )
    });
}

criterion_group!(
    benches,
    benchmark_move_vec,
    benchmark_move_arc_vec_slow,
    benchmark_move_arc_vec_fast,
);
criterion_main!(benches);

“Moving a value between threads” doesn’t actually do anything to the value that moving it within one thread does, so there should be no difference. I suspect you are making some kind of measurement error. Could you rerun your benchmark and be sure it’s exactly the same code you show? In particular,

Benchmarking move_arc_vec_spawn: Warming up for 3.0000 s
move_arc_vec #2         time:   [1.0580 ms 1.1929 ms 1.2892 ms]

The name doesn’t match here, so something must have been edited. Please post un-edited results of a run of exactly the code you shared, to make sure no errors are introduced.

5 Likes

This part is speculation

I can only reproduce your findings by removing the join() in the third benchmark. :person_shrugging:

This part is wisdom

Criterion spits out a lot of sample count/target time warnings because your tests are performing a large number of iterations. Don't do that; that's criterion's job. [1]

When you isolate the code to measure, you'll notice that the benchmarks are measuring thread::spawn(). And that becomes significant at the micro level. On my system (macOS) it takes about 3x longer to spawn a thread than to send a small type to one that is already waiting.

When I fix those issues, the benchmarks run much faster and produce more accurate results. It shows that, at least on my system, the benchmark is completely dominated by thread synchronization.

move_vec                time:   [9.0705 µs 9.1496 µs 9.2271 µs]
move_arc_vec_st         time:   [167.28 ns 169.76 ns 172.62 ns]
move_arc_vec_spawn      time:   [9.0562 µs 9.1570 µs 9.2499 µs]
Fixed benchmark code
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use crossbeam_channel::Sender;
use rand::{thread_rng, RngCore};
use std::{sync::Arc, thread};

const VEC_SIZE: usize = 500;
const ELEMENT_SIZE: usize = 1232;

fn prepare_vec() -> Vec<[u8; ELEMENT_SIZE]> {
    let mut vec = vec![[0u8; ELEMENT_SIZE]; VEC_SIZE];
    let mut rng = thread_rng();

    for arr in vec.iter_mut() {
        rng.fill_bytes(arr);
    }

    vec
}

fn benchmark_move_vec(c: &mut Criterion) {
    c.bench_function("move_vec", |b| {
        b.iter_batched(
            || {
                let (sender, receiver) =
                    crossbeam_channel::bounded::<(Vec<[u8; ELEMENT_SIZE]>, Sender<()>)>(1);
                // Do not measure thread spawn
                thread::spawn(move || {
                    let (_, tx) = receiver.recv().unwrap();
                    tx.send(()).unwrap();
                });
                let (tx, rx) = crossbeam_channel::bounded(1);
                (prepare_vec(), sender, tx, rx)
            },
            |(vec, sender, tx, rx)| {
                sender.send((black_box(vec), tx)).unwrap();
                rx.recv().unwrap();
            },
            criterion::BatchSize::LargeInput,
        )
    });
}

fn benchmark_move_arc_vec_st(c: &mut Criterion) {
    c.bench_function("move_arc_vec_st", |b| {
        b.iter_batched(
            || {
                let (sender, receiver) = crossbeam_channel::bounded(1);
                (Arc::new(prepare_vec()), sender, receiver)
            },
            |(vec, sender, receiver)| {
                sender.send(black_box(vec)).unwrap();
                let arc_vec = receiver.recv().unwrap();
                black_box(arc_vec);
            },
            criterion::BatchSize::LargeInput,
        )
    });
}

fn benchmark_move_arc_vec_spawn(c: &mut Criterion) {
    c.bench_function("move_arc_vec_spawn", |b| {
        b.iter_batched(
            || {
                let (sender, receiver) =
                    crossbeam_channel::bounded::<(Arc<Vec<[u8; ELEMENT_SIZE]>>, Sender<()>)>(1);
                // Do not measure thread spawn
                thread::spawn(move || {
                    let (_, tx) = receiver.recv().unwrap();
                    tx.send(()).unwrap();
                });
                let (tx, rx) = crossbeam_channel::bounded(1);
                (Arc::new(prepare_vec()), sender, tx, rx)
            },
            |(arc_vec, sender, tx, rx)| {
                sender.send((black_box(arc_vec), tx)).unwrap();
                rx.recv().unwrap();
            },
            criterion::BatchSize::LargeInput,
        )
    });
}

criterion_group!(
    benches,
    benchmark_move_vec,
    benchmark_move_arc_vec_st,
    benchmark_move_arc_vec_spawn,
);
criterion_main!(benches);

Still, the results are not terribly interesting. Vec<T> is 24 bytes on a 64-bit system, and Arc<Vec<T>> is 8 bytes. Sending either is always inexpensive. There is no measurable difference.


  1. if you want to increase the number of iterations, use Criterion::sample_size() and Criterion::measurement_time(), as those warnings are suggesting. ↩︎

7 Likes

Here is the re-run for 10k:

Benchmarking move_vec: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 31.7s.
move_vec                time:   [268.17 ms 271.46 ms 275.19 ms]
                        change: [+620.77% +636.72% +653.20%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking move_arc_vec_st: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 31.7s.
move_arc_vec_st         time:   [268.32 ms 270.54 ms 273.30 ms]
                        change: [+636.79% +647.50% +659.50%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking move_arc_vec_spawn: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 19.8s.
move_arc_vec_spawn      time:   [1.0019 ms 1.1384 ms 1.2372 ms]
                        change: [-97.280% -96.887% -96.600%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 10 measurements (40.00%)
  1 (10.00%) low severe
  1 (10.00%) low mild
  1 (10.00%) high mild
  1 (10.00%) high severe

And one more time for 10k:

Gnuplot not found, using plotters backend
Benchmarking move_vec: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 31.8s.
move_vec                time:   [270.71 ms 274.46 ms 279.13 ms]
                        change: [-0.8937% +1.1058% +3.3414%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

Benchmarking move_arc_vec_st: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 31.7s.
move_arc_vec_st         time:   [268.94 ms 271.51 ms 274.57 ms]
                        change: [-1.0199% +0.3566% +1.7423%] (p = 0.65 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking move_arc_vec_spawn: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 19.7s.
move_arc_vec_spawn      time:   [1.1255 ms 1.1702 ms 1.2027 ms]
                        change: [-6.1655% +2.7908% +17.422%] (p = 0.71 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) high mild

BTW, you can also use Arc<[u8]> that's smaller than Vec<u8>, and has fewer levels of indirection than Arc<Vec<u8>>. The only downside is that converting Vec<u8> to Arc<[u8]> requires a copy, but you can also iter.collect() straight into Arc<[T]>.

7 Likes

@parasyte Why is the thread spawn example faster? It feels like there is some compiler optimization that is dramatically reducing the work

I don't think it can be faster. I tried showing earlier that I could only "reproduce" by intentionally sabotaging the benchmark.

I also provided a different benchmark methodology which shows that moving the two types are the same. And that the thread has synchronization overhead, as expected.