Super Slow Concurrent Vector Implementation

Hi All,

I'm implementing the lock-free vector in the paper Lock-free Dynamically Resizable Arrays by Dechev et al., 2006, and my implementation is super slow. I followed the algorithm from the paper pretty closely.

Most of the implementation is in src/leaky.rs (about 500 LOC excluding tests/benches), except for a few convenience functions defined in src/lib.rs and src/alloc_error.rs

From the miri tests I've done, the vector doesn't do any UB. However, it does leak memory (that's ok for now, I haven't implemented a reclamation strategy). Note: at the moment, the vector can only be used for 64-bit types.

My main question is: why is my implementation is so slow: :cry: ? There are some mini-benchmarks in the bottom of leaky.rs that show this. Here's some example output:

test leaky::bench::mutex1 ... bench: 49,746 ns/iter (+/- 1,849)
test leaky::bench::mutex2 ... bench: 117,679 ns/iter (+/- 2,044)
test leaky::bench::mutex3 ... bench: 201,409 ns/iter (+/- 5,257)
test leaky::bench::unlocked1 ... bench: 200,582 ns/iter (+/- 62,360)
test leaky::bench::unlocked2 ... bench: 425,359 ns/iter (+/- 52,239)
test leaky::bench::unlocked3 ... bench: 684,456 ns/iter (+/- 120,988)

The benchmark just consists of many threads pushing to a vector 1000 times. These are the results for 1, 2, and 3 threads.

Thank you for your help! Any feedback not related to performance is of course welcome too :slight_smile:

(I haven't read the code yet, however) Have you compiled in release mode?

Also, is there some reference implementation or something you can compare the performance of your implementation to?

Of course remember that cleans MIRI logs don't necessarily mean UB-free code. MIRI reports only errors on the executed code.

I haven't specifically compiled in release mode. It says here cargo bench does that automatically though.

Sadly, I haven't found a reference implementation so far, in any language (the paper was geared towards C++). The paper is a bit old though (2006), so maybe the algorithm is a bit outdated.

Thanks for the reminder about MIRI, can never be too safe :four_leaf_clover:

I'd suggest benchmarking against a Mutex<Vec<&>>. If you can't beat that with multiple competing threads then you know your implementation is slow.

3 Likes

Note that generally speaking lock free algorithms have less average case throughputs but better worst case latency compared to the ones with fine-grained locks. It's possible that the benchmark is correct.

4 Likes

Will do! How would one construct a vector of references though? Would a reference to a static/const variable work?

Yes, the easy approach would be to create a Vec of &'static references. You can get them (for benchmarking purposes) by just leaking Box.

Ok, made some simple benchmarks and the performance doesn't seem as bad compared to Mutex<Vec<&>>. They're both linear but the lock-free vector has a greater slope. Thanks for the suggestion!

1 Like