Why is moving a Arc<Vec<[u8; N]>> much faster than Vec<[u8; N]>?

Moving an Arc<Vec<_>> across threads is hundreds of times fast. Is the compiler optimizing away the results? If I change the number of vecs moved from 10k to 1k, the behaviour is normal again.

Results for N=10000:

Benchmarking move_vec: Warming up for 3.0000 s
move_vec                time:   [275.65 ms 282.99 ms 291.80 ms]

Benchmarking move_arc_vec_st: Warming up for 3.0000 s
move_arc_vec            time:   [277.06 ms 280.58 ms 284.38 ms]

Benchmarking move_arc_vec_spawn: Warming up for 3.0000 s
move_arc_vec #2         time:   [1.0580 ms 1.1929 ms 1.2892 ms]

Results for N=1000:

move_vec                time:   [36.155 ms 36.847 ms 37.509 ms]

move_arc_vec_st         time:   [35.767 ms 36.193 ms 36.636 ms]

move_arc_vec_spawn      time:   [36.266 ms 36.568 ms 36.914 ms]

Code:

use std::{sync::Arc, thread};

use criterion::{black_box, criterion_group, criterion_main, Criterion};

const VEC_SIZE: usize = 500;
const ELEMENT_SIZE: usize = 1232;
const NUM_VECS: usize = 10000;

use rand::{thread_rng, RngCore};

fn prepare_vec() -> Vec<[u8; ELEMENT_SIZE]> {
    let mut vec = vec![[0u8; ELEMENT_SIZE]; VEC_SIZE];
    let mut rng = thread_rng();

    for arr in vec.iter_mut() {
        rng.fill_bytes(arr);
    }

    vec
}
fn benchmark_move_vec(c: &mut Criterion) {
    c.bench_function("move_vec", |b| {
        b.iter_batched(
            || {
                // Setup: Create channel and pre-fill with 1000 Vecs
                let (sender, receiver) =
                    crossbeam_channel::bounded::<Vec<[u8; ELEMENT_SIZE]>>(100000);
                for _ in 0..NUM_VECS {
                    sender.send(prepare_vec()).unwrap();
                }
                receiver
            },
            |receiver| {
                // Benchmark: Receive the Vecs in a separate thread
                let handle = thread::spawn(move || {
                    for _i in 0..NUM_VECS {
                        let vec = receiver.recv().unwrap();
                        black_box(vec);
                    }
                });
                handle.join().unwrap();
            },
            criterion::BatchSize::LargeInput,
        )
    });
}
fn benchmark_move_arc_vec_slow(c: &mut Criterion) {
    c.bench_function("move_arc_vec_st", |b| {
        b.iter_batched(
            || {
                // Setup: Create channel and pre-fill with 1000 Arc<Vec>s
                let (sender, receiver) =
                    crossbeam_channel::bounded::<Arc<Vec<[u8; ELEMENT_SIZE]>>>(100000);
                for _ in 0..NUM_VECS {
                    sender.send(Arc::new(prepare_vec())).unwrap();
                }
                receiver
            },
            |receiver| {
                for _i in 0..NUM_VECS {
                    let arc_vec = receiver.recv().unwrap();
                    black_box(arc_vec);
                }
            },
            criterion::BatchSize::LargeInput,
        )
    });
}

fn benchmark_move_arc_vec_fast(c: &mut Criterion) {
    c.bench_function("move_arc_vec_spawn", |b| {
        b.iter_batched(
            || {
                // Setup: Create channel and pre-fill with 1000 Arc<Vec>s
                let (sender, receiver) =
                    crossbeam_channel::bounded::<Arc<Vec<[u8; ELEMENT_SIZE]>>>(100000);
                for _ in 0..NUM_VECS {
                    sender.send(Arc::new(prepare_vec())).unwrap();
                }
                receiver
            },
            |receiver| {
                // Benchmark: Receive the Arc<Vec>s in a separate thread
                let handle = thread::spawn(move || {
                    for _i in 0..NUM_VECS {
                        let arc_vec = receiver.recv().unwrap();
                        black_box(arc_vec);
                    }
                });
                handle.join().unwrap();
            },
            criterion::BatchSize::LargeInput,
        )
    });
}

criterion_group!(
    benches,
    benchmark_move_vec,
    benchmark_move_arc_vec_slow,
    benchmark_move_arc_vec_fast,
);
criterion_main!(benches);

“Moving a value between threads” doesn’t actually do anything to the value that moving it within one thread does, so there should be no difference. I suspect you are making some kind of measurement error. Could you rerun your benchmark and be sure it’s exactly the same code you show? In particular,

Benchmarking move_arc_vec_spawn: Warming up for 3.0000 s
move_arc_vec #2         time:   [1.0580 ms 1.1929 ms 1.2892 ms]

The name doesn’t match here, so something must have been edited. Please post un-edited results of a run of exactly the code you shared, to make sure no errors are introduced.

5 Likes

This part is speculation

I can only reproduce your findings by removing the join() in the third benchmark. :person_shrugging:

This part is wisdom

Criterion spits out a lot of sample count/target time warnings because your tests are performing a large number of iterations. Don't do that; that's criterion's job. [1]

When you isolate the code to measure, you'll notice that the benchmarks are measuring thread::spawn(). And that becomes significant at the micro level. On my system (macOS) it takes about 3x longer to spawn a thread than to send a small type to one that is already waiting.

When I fix those issues, the benchmarks run much faster and produce more accurate results. It shows that, at least on my system, the benchmark is completely dominated by thread synchronization.

move_vec                time:   [9.0705 µs 9.1496 µs 9.2271 µs]
move_arc_vec_st         time:   [167.28 ns 169.76 ns 172.62 ns]
move_arc_vec_spawn      time:   [9.0562 µs 9.1570 µs 9.2499 µs]
Fixed benchmark code
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use crossbeam_channel::Sender;
use rand::{thread_rng, RngCore};
use std::{sync::Arc, thread};

const VEC_SIZE: usize = 500;
const ELEMENT_SIZE: usize = 1232;

fn prepare_vec() -> Vec<[u8; ELEMENT_SIZE]> {
    let mut vec = vec![[0u8; ELEMENT_SIZE]; VEC_SIZE];
    let mut rng = thread_rng();

    for arr in vec.iter_mut() {
        rng.fill_bytes(arr);
    }

    vec
}

fn benchmark_move_vec(c: &mut Criterion) {
    c.bench_function("move_vec", |b| {
        b.iter_batched(
            || {
                let (sender, receiver) =
                    crossbeam_channel::bounded::<(Vec<[u8; ELEMENT_SIZE]>, Sender<()>)>(1);
                // Do not measure thread spawn
                thread::spawn(move || {
                    let (_, tx) = receiver.recv().unwrap();
                    tx.send(()).unwrap();
                });
                let (tx, rx) = crossbeam_channel::bounded(1);
                (prepare_vec(), sender, tx, rx)
            },
            |(vec, sender, tx, rx)| {
                sender.send((black_box(vec), tx)).unwrap();
                rx.recv().unwrap();
            },
            criterion::BatchSize::LargeInput,
        )
    });
}

fn benchmark_move_arc_vec_st(c: &mut Criterion) {
    c.bench_function("move_arc_vec_st", |b| {
        b.iter_batched(
            || {
                let (sender, receiver) = crossbeam_channel::bounded(1);
                (Arc::new(prepare_vec()), sender, receiver)
            },
            |(vec, sender, receiver)| {
                sender.send(black_box(vec)).unwrap();
                let arc_vec = receiver.recv().unwrap();
                black_box(arc_vec);
            },
            criterion::BatchSize::LargeInput,
        )
    });
}

fn benchmark_move_arc_vec_spawn(c: &mut Criterion) {
    c.bench_function("move_arc_vec_spawn", |b| {
        b.iter_batched(
            || {
                let (sender, receiver) =
                    crossbeam_channel::bounded::<(Arc<Vec<[u8; ELEMENT_SIZE]>>, Sender<()>)>(1);
                // Do not measure thread spawn
                thread::spawn(move || {
                    let (_, tx) = receiver.recv().unwrap();
                    tx.send(()).unwrap();
                });
                let (tx, rx) = crossbeam_channel::bounded(1);
                (Arc::new(prepare_vec()), sender, tx, rx)
            },
            |(arc_vec, sender, tx, rx)| {
                sender.send((black_box(arc_vec), tx)).unwrap();
                rx.recv().unwrap();
            },
            criterion::BatchSize::LargeInput,
        )
    });
}

criterion_group!(
    benches,
    benchmark_move_vec,
    benchmark_move_arc_vec_st,
    benchmark_move_arc_vec_spawn,
);
criterion_main!(benches);

Still, the results are not terribly interesting. Vec<T> is 24 bytes on a 64-bit system, and Arc<Vec<T>> is 8 bytes. Sending either is always inexpensive. There is no measurable difference.


  1. if you want to increase the number of iterations, use Criterion::sample_size() and Criterion::measurement_time(), as those warnings are suggesting. ↩︎

7 Likes

Here is the re-run for 10k:

Benchmarking move_vec: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 31.7s.
move_vec                time:   [268.17 ms 271.46 ms 275.19 ms]
                        change: [+620.77% +636.72% +653.20%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking move_arc_vec_st: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 31.7s.
move_arc_vec_st         time:   [268.32 ms 270.54 ms 273.30 ms]
                        change: [+636.79% +647.50% +659.50%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking move_arc_vec_spawn: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 19.8s.
move_arc_vec_spawn      time:   [1.0019 ms 1.1384 ms 1.2372 ms]
                        change: [-97.280% -96.887% -96.600%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 10 measurements (40.00%)
  1 (10.00%) low severe
  1 (10.00%) low mild
  1 (10.00%) high mild
  1 (10.00%) high severe

And one more time for 10k:

Gnuplot not found, using plotters backend
Benchmarking move_vec: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 31.8s.
move_vec                time:   [270.71 ms 274.46 ms 279.13 ms]
                        change: [-0.8937% +1.1058% +3.3414%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

Benchmarking move_arc_vec_st: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 31.7s.
move_arc_vec_st         time:   [268.94 ms 271.51 ms 274.57 ms]
                        change: [-1.0199% +0.3566% +1.7423%] (p = 0.65 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking move_arc_vec_spawn: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 19.7s.
move_arc_vec_spawn      time:   [1.1255 ms 1.1702 ms 1.2027 ms]
                        change: [-6.1655% +2.7908% +17.422%] (p = 0.71 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) high mild

BTW, you can also use Arc<[u8]> that's smaller than Vec<u8>, and has fewer levels of indirection than Arc<Vec<u8>>. The only downside is that converting Vec<u8> to Arc<[u8]> requires a copy, but you can also iter.collect() straight into Arc<[T]>.

7 Likes

@parasyte Why is the thread spawn example faster? It feels like there is some compiler optimization that is dramatically reducing the work

I don't think it can be faster. I tried showing earlier that I could only "reproduce" by intentionally sabotaging the benchmark.

I also provided a different benchmark methodology which shows that moving the two types are the same. And that the thread has synchronization overhead, as expected.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.