TLDR:
I found some potential to increase the performance of an RwLock. Is it true? Could we achieve it?
Sorry for my poor English.
TOC:
- Strange time consumption of RwLock
- Original post and its potential speed
- Speed up by a
Placeholder
?
Strange time consumption of RwLock
I found someone complain about the performance of rust a year ago, and finally I found it is std::sync::RwLock::read
that make their program slow.
use std::sync::Arc;
use std::sync::RwLock;
use std::thread;
use std::time;
fn main() {
for i in 1..=8 {
workload(i);
}
}
fn workload(concurrency: usize) {
let total = 1000 * 1000;
let mut m = (); // Let us pretend it is a type that could be modified.
let m = Arc::new(RwLock::new(m));
let now = time::Instant::now();
let threads: Vec<_> = (0..concurrency)
.map(|_| {
let m = m.clone();
thread::spawn(move || {
for _ in 0..total {
let _x = m.read(); // here, only read occurs, it should be multi-threaded, rather than read sequentially.
// but the timing shows that, with thread number increases, it become slower.
}
})
})
.collect();
for t in threads {
t.join().unwrap();
}
let t = now.elapsed();
println!(
"threads: {}; time used: {:?}; ips: {}",
concurrency,
t,
(total * concurrency) as f64 / t.as_secs_f64()
);
}
results in a 8 core 16 threads laptop:
rustc --edition 2021 test.rs -C opt-level=3 -C target-cpu=native -C codegen-units=1 -C lto -o test && ./test
warning: variable does not need to be mutable
--> test.rs:13:9
|
13 | let mut m = (); // Let us pretend it is a type that could be modified.
| ----^
| |
| help: remove this `mut`
|
= note: `#[warn(unused_mut)]` on by default
warning: 1 warning emitted
threads: 1; time used: 25.808535ms; ips: 38746871.916596584
threads: 2; time used: 152.874456ms; ips: 13082630.364355966
threads: 3; time used: 297.572933ms; ips: 10081562.088847645
threads: 4; time used: 436.665354ms; ips: 9160332.880451055
threads: 5; time used: 585.917125ms; ips: 8533630.076096512
threads: 6; time used: 722.995444ms; ips: 8298807.481835252
threads: 7; time used: 821.585243ms; ips: 8520114.083889406
threads: 8; time used: 923.248645ms; ips: 8665054.688490769
if only ONE reader could take the RwLock, why we have a even slower version of Mutex?
Original post and its potential speed
Original script (Chinese forum) with ArcSwap/RwLock, which shows a potential faster code exists.
/*
$ cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target/release/arcswap-test`
threads: 1; time used: 116.382621ms; ips: 8592348.165109634
threads: 2; time used: 130.590879ms; ips: 15315005.269242426
threads: 3; time used: 111.199132ms; ips: 26978627.854756996
threads: 4; time used: 112.026484ms; ips: 35705842.557729475
threads: 5; time used: 110.481194ms; ips: 45256570.99614619
threads: 6; time used: 109.655608ms; ips: 54716763.77919495
threads: 7; time used: 116.505931ms; ips: 60082778.10337398
threads: 8; time used: 111.219234ms; ips: 71930004.48105945
*/
use std::collections::HashMap;
use std::sync::Arc;
use arc_swap::ArcSwap;//第一处更改,原文是 use std::sync::RwLock;
use std::thread;
use std::time;
fn main() {
for i in 1..=8 {
workload(i);
}
}
fn workload(concurrency: usize) {
let total = 1000 * 1000;
let mut m = HashMap::new();
for i in 0..total {
m.insert(i, i);
}
let m = Arc::new(ArcSwap::from_pointee(m)); // 第二处更改,原文是let m = Arc::new(RwLock::new(m));
let now = time::Instant::now();
let threads: Vec<_> = (0..concurrency)
.map(|_| {
let m = m.clone();
thread::spawn(move || {
for i in 0..total {
let _x = m.load().get(&i).unwrap(); // 第三处更改,原文是 let _x = m.read().unwrap().get(&i);
}
})
})
.collect();
for t in threads {
t.join().unwrap();
}
let t = now.elapsed();
println!(
"threads: {}; time used: {:?}; ips: {}",
concurrency,
t,
(total * concurrency) as f64 / t.as_secs_f64()
);
}
Speed up by a Placeholder
?
Further, a strange thing occurs when I tried to figure out what cause the slow RwLock:
use std::sync::Arc;
use std::thread;
use std::time;
use std::sync::atomic::{AtomicBool, AtomicIsize};
fn main() {
for i in 1..=8 {
workload(i);
}
}
fn workload(concurrency: usize) {
let total = 1000 * 1000;
let m = Arc::new((AtomicIsize::new(0),AtomicBool::new(false)));
let now = time::Instant::now();
let threads: Vec<_> = (0..concurrency)
.map(|_| {
let m = m.clone();
thread::spawn(move || {
for _ in 0..total {
if !m.1.load(std::sync::atomic::Ordering::Relaxed){ // check whether writer exists
m.0.fetch_add(1,std::sync::atomic::Ordering::SeqCst); // add read lock
m.0.fetch_sub(1,std::sync::atomic::Ordering::Relaxed); // remove read lock
}
}
})
})
.collect();
for t in threads {
t.join().unwrap();
}
let t = now.elapsed();
println!(
"threads: {}; time used: {:?}; ips: {}",
concurrency,
t,
(total * concurrency) as f64 / t.as_secs_f64()
);
}
/*
rustc --edition 2021 testatomic.rs -C opt-level=3 -C target-cpu=native -C codegen-units=1 -C lto -o testatomic && ./testatomic
threads: 1; time used: 17.719521ms; ips: 56434934.10459572
threads: 2; time used: 88.080032ms; ips: 22706622.086604144
threads: 3; time used: 193.446272ms; ips: 15508182.034130903
threads: 4; time used: 289.305262ms; ips: 13826226.223289363
threads: 5; time used: 410.854842ms; ips: 12169748.263548516
threads: 6; time used: 506.291875ms; ips: 11850871.594571808
threads: 7; time used: 586.894164ms; ips: 11927193.060314704
threads: 8; time used: 671.969719ms; ips: 11905298.369553465
*/
It seems to have the same performance with RwLock
even if we change the order to Relaxed.
But things become strange by adding a placeholder: change
let m = Arc::new((AtomicIsize::new(0),AtomicBool::new(false)));
to
let m = Arc::new((AtomicIsize::new(0),AtomicIsize::new(0)/*placeholder*/,AtomicBool::new(false)));
and change
m.1.load(std::sync::atomic::Ordering::Relaxed);
to
m.2.load(std::sync::atomic::Ordering::SeqCst);// even SeqCst here could be faster.
we would have a 2x performance gain:
rustc --edition 2021 testatomic.rs -C opt-level=3 -C target-cpu=native -C codegen-units=1 -C lto -o testatomic && ./testatomic
threads: 1; time used: 17.803156ms; ips: 56169816.183153145
threads: 2; time used: 77.515234ms; ips: 25801379.89391866
threads: 3; time used: 119.743647ms; ips: 25053521.211025085
threads: 4; time used: 155.41322ms; ips: 25737836.202094007
threads: 5; time used: 202.043937ms; ips: 24747092.509883136
threads: 6; time used: 232.283738ms; ips: 25830478.068163343
threads: 7; time used: 277.488703ms; ips: 25226252.18367899
threads: 8; time used: 323.873355ms; ips: 24701013.147561956