Could we make `std::sync::RwLock::read` faster?

TLDR:

I found some potential to increase the performance of an RwLock. Is it true? Could we achieve it?

Sorry for my poor English.

TOC:

  1. Strange time consumption of RwLock
  2. Original post and its potential speed
  3. Speed up by a Placeholder?

Strange time consumption of RwLock

I found someone complain about the performance of rust a year ago, and finally I found it is std::sync::RwLock::read that make their program slow.

use std::sync::Arc;
use std::sync::RwLock;
use std::thread;
use std::time;
fn main() {
    for i in 1..=8 {
        workload(i);
    }
}

fn workload(concurrency: usize) {
    let total = 1000 * 1000;
    let mut m = (); // Let us pretend it is a type that could be modified.
    let m = Arc::new(RwLock::new(m));

    let now = time::Instant::now();
    let threads: Vec<_> = (0..concurrency)
        .map(|_| {
            let m = m.clone();
            thread::spawn(move || {
                for _ in 0..total {
                    let _x = m.read(); // here, only read occurs, it should be multi-threaded, rather than read sequentially.
                    // but the timing shows that, with thread number increases, it become slower.
                }
            })
        })
        .collect();

    for t in threads {
        t.join().unwrap();
    }
    let t = now.elapsed();
    println!(
        "threads: {}; time used: {:?}; ips: {}",
        concurrency,
        t,
        (total * concurrency) as f64 / t.as_secs_f64()
    );
}

results in a 8 core 16 threads laptop:

rustc --edition 2021 test.rs -C opt-level=3 -C target-cpu=native -C codegen-units=1 -C lto -o test && ./test
warning: variable does not need to be mutable
  --> test.rs:13:9
   |
13 |     let mut m = (); // Let us pretend it is a type that could be modified.
   |         ----^
   |         |
   |         help: remove this `mut`
   |
   = note: `#[warn(unused_mut)]` on by default

warning: 1 warning emitted

threads: 1; time used: 25.808535ms; ips: 38746871.916596584
threads: 2; time used: 152.874456ms; ips: 13082630.364355966
threads: 3; time used: 297.572933ms; ips: 10081562.088847645
threads: 4; time used: 436.665354ms; ips: 9160332.880451055
threads: 5; time used: 585.917125ms; ips: 8533630.076096512
threads: 6; time used: 722.995444ms; ips: 8298807.481835252
threads: 7; time used: 821.585243ms; ips: 8520114.083889406
threads: 8; time used: 923.248645ms; ips: 8665054.688490769

if only ONE reader could take the RwLock, why we have a even slower version of Mutex?


Original post and its potential speed

Original script (Chinese forum) with ArcSwap/RwLock, which shows a potential faster code exists.

/*
$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/arcswap-test`
threads: 1; time used: 116.382621ms; ips: 8592348.165109634
threads: 2; time used: 130.590879ms; ips: 15315005.269242426
threads: 3; time used: 111.199132ms; ips: 26978627.854756996
threads: 4; time used: 112.026484ms; ips: 35705842.557729475
threads: 5; time used: 110.481194ms; ips: 45256570.99614619
threads: 6; time used: 109.655608ms; ips: 54716763.77919495
threads: 7; time used: 116.505931ms; ips: 60082778.10337398
threads: 8; time used: 111.219234ms; ips: 71930004.48105945
*/
use std::collections::HashMap;
use std::sync::Arc;
use arc_swap::ArcSwap;//第一处更改,原文是 use std::sync::RwLock;
use std::thread;
use std::time;

fn main() {
    for i in 1..=8 {
        workload(i);
    }
}

fn workload(concurrency: usize) {
    let total = 1000 * 1000;
    let mut m = HashMap::new();
    for i in 0..total {
        m.insert(i, i);
    }
    let m = Arc::new(ArcSwap::from_pointee(m)); // 第二处更改,原文是let m = Arc::new(RwLock::new(m));

    let now = time::Instant::now();
    let threads: Vec<_> = (0..concurrency)
        .map(|_| {
            let m = m.clone();
            thread::spawn(move || {
                for i in 0..total {
                    let _x = m.load().get(&i).unwrap(); // 第三处更改,原文是 let _x = m.read().unwrap().get(&i);
                }
            })
        })
        .collect();

    for t in threads {
        t.join().unwrap();
    }
    let t = now.elapsed();
    println!(
        "threads: {}; time used: {:?}; ips: {}",
        concurrency,
        t,
        (total * concurrency) as f64 / t.as_secs_f64()
    );
}

Speed up by a Placeholder?

Further, a strange thing occurs when I tried to figure out what cause the slow RwLock:

use std::sync::Arc;
use std::thread;
use std::time;

use std::sync::atomic::{AtomicBool, AtomicIsize};

fn main() {
    for i in 1..=8 {
        workload(i);
    }
}

fn workload(concurrency: usize) {
    let total = 1000 * 1000;
    let m = Arc::new((AtomicIsize::new(0),AtomicBool::new(false)));

    let now = time::Instant::now();
    let threads: Vec<_> = (0..concurrency)
        .map(|_| {
            let m = m.clone();
            thread::spawn(move || {
                for _ in 0..total {
                    if !m.1.load(std::sync::atomic::Ordering::Relaxed){ // check whether writer exists
                        m.0.fetch_add(1,std::sync::atomic::Ordering::SeqCst); // add read lock
                        m.0.fetch_sub(1,std::sync::atomic::Ordering::Relaxed); // remove read lock
                    }
                }
            })
        })
        .collect();

    for t in threads {
        t.join().unwrap();
    }
    let t = now.elapsed();
    println!(
        "threads: {}; time used: {:?}; ips: {}",
        concurrency,
        t,
        (total * concurrency) as f64 / t.as_secs_f64()
    );
}
/* 
rustc --edition 2021 testatomic.rs -C opt-level=3 -C target-cpu=native -C codegen-units=1 -C lto -o testatomic && ./testatomic
threads: 1; time used: 17.719521ms; ips: 56434934.10459572
threads: 2; time used: 88.080032ms; ips: 22706622.086604144
threads: 3; time used: 193.446272ms; ips: 15508182.034130903
threads: 4; time used: 289.305262ms; ips: 13826226.223289363
threads: 5; time used: 410.854842ms; ips: 12169748.263548516
threads: 6; time used: 506.291875ms; ips: 11850871.594571808
threads: 7; time used: 586.894164ms; ips: 11927193.060314704
threads: 8; time used: 671.969719ms; ips: 11905298.369553465
*/

It seems to have the same performance with RwLock even if we change the order to Relaxed.
But things become strange by adding a placeholder: change

let m = Arc::new((AtomicIsize::new(0),AtomicBool::new(false)));

to

let m = Arc::new((AtomicIsize::new(0),AtomicIsize::new(0)/*placeholder*/,AtomicBool::new(false)));

and change

m.1.load(std::sync::atomic::Ordering::Relaxed);

to

m.2.load(std::sync::atomic::Ordering::SeqCst);// even SeqCst here could be faster.

we would have a 2x performance gain:

rustc --edition 2021 testatomic.rs -C opt-level=3 -C target-cpu=native -C codegen-units=1 -C lto -o testatomic && ./testatomic
threads: 1; time used: 17.803156ms; ips: 56169816.183153145
threads: 2; time used: 77.515234ms; ips: 25801379.89391866
threads: 3; time used: 119.743647ms; ips: 25053521.211025085
threads: 4; time used: 155.41322ms; ips: 25737836.202094007
threads: 5; time used: 202.043937ms; ips: 24747092.509883136
threads: 6; time used: 232.283738ms; ips: 25830478.068163343
threads: 7; time used: 277.488703ms; ips: 25226252.18367899
threads: 8; time used: 323.873355ms; ips: 24701013.147561956

Nice analysis!
Although I couldn't read it very well, I can tell this that it is a well known problem that std library's implementation of synchronization primitives aren't the fastest. parking_lot provides considerably faster versions of them. I'm not sure if they do the same thing as you're suggesting or something else - but I think it's worth looking at.

2 Likes

Thanks for your insight, but IMHO, parking_lot's implementation isn't faster than std crates.

It seems that parking_lot is mainly focus on "Fairness", rather than performance.

use std::collections::HashMap;
use std::sync::Arc;
//use arc_swap::ArcSwap;//第一处更改,原文是 
//use std::sync::RwLock;
use parking_lot::RwLock;
use std::thread;
use std::time;

fn main() {
    for i in 1..=8 {
        workload(i);
    }
}

fn workload(concurrency: usize) {
    let total = 1000 * 1000;
    let mut m = HashMap::new();
    for i in 0..total {
        m.insert(i, i);
    }
//    let m = Arc::new(ArcSwap::from_pointee(m)); // 第二处更改,原文是
let m = Arc::new(RwLock::new(m));

    let now = time::Instant::now();
    let threads: Vec<_> = (0..concurrency)
        .map(|_| {
            let m = m.clone();
            thread::spawn(move || {
                for i in 0..total {
//                    let _x = m.load().get(&i).unwrap(); // 第三处更改,原文是 
let _x = m.read().get(&i).unwrap();
                }
            })
        })
        .collect();

    for t in threads {
        t.join().unwrap();
    }
    let t = now.elapsed();
    println!(
        "threads: {}; time used: {:?}; ips: {}",
        concurrency,
        t,
        (total * concurrency) as f64 / t.as_secs_f64()
    );
}
/*
$ cargo run --release
   Compiling arcswap-test v0.1.0 (/me/arcswap-test)
    Finished release [optimized] target(s) in 1.10s
     Running `target/release/arcswap-test`
threads: 1; time used: 79.88677ms; ips: 12517717.264072638
threads: 2; time used: 294.210531ms; ips: 6797853.201250638
threads: 3; time used: 444.605855ms; ips: 6747549.467156703
threads: 4; time used: 593.299307ms; ips: 6741959.669944465
threads: 5; time used: 714.134203ms; ips: 7001485.125618609
threads: 6; time used: 877.458533ms; ips: 6837929.969734536
threads: 7; time used: 1.004104751s; ips: 6971384.203718402
threads: 8; time used: 1.140719659s; ips: 7013116.62061923
*/

Which platform are you testing on? Note that until recently, most unix-like targets have used the pthread implementation for Mutex and RwLock, but this has been rewritten to use futex APIs -- coming soon in Rust 1.62. See the following issue for details:

4 Likes

I'm using Linux (Manjaro) now.

This is the version I test the program before

$ rustc --version
rustc 1.63.0-nightly (4c5f6e627 2022-05-17)

after an rustup update, I test them again

$ uname -a
Linux DR722 5.17.9-1-MANJARO #1 SMP PREEMPT Wed May 18 09:20:53 UTC 2022 x86_64 GNU/Linux
$ rustc --version
rustc 1.63.0-nightly (ca122c7eb 2022-06-13)
$ rc testrw.rs 
rustc --edition 2021 testrw.rs -C opt-level=3 -C target-cpu=native -C codegen-units=1 -C lto -o testrw && ./testrw
warning: variable does not need to be mutable
  --> testrw.rs:13:9
   |
13 |     let mut m = (); // Let us pretend it is a type that could be modified.
   |         ----^
   |         |
   |         help: remove this `mut`
   |
   = note: `#[warn(unused_mut)]` on by default

warning: 1 warning emitted

threads: 1; time used: 25.800564ms; ips: 38758842.63615322
threads: 2; time used: 192.598684ms; ips: 10384286.945595121
threads: 3; time used: 338.35728ms; ips: 8866367.527248122
threads: 4; time used: 415.785155ms; ips: 9620353.088363629
threads: 5; time used: 529.165653ms; ips: 9448836.99774067
threads: 6; time used: 685.666015ms; ips: 8750615.997790119
threads: 7; time used: 828.375365ms; ips: 8450275.437633276
threads: 8; time used: 904.825809ms; ips: 8841480.780528886

It seems that futex is not enabled, or it is not significantly faster than the old std RwLock.

I'm not sure about this, but I think your use case is not what an RwLock is designed for. What it does "better" than another lock is that multiple readers can hold the lock simultaneous, but in this case all locks are released immediatley, so a default Mutex may actually perform better, as it has less "housekeeping" to do.

To allow RwLock to do its thing, try to hold on to a lock for a few operations, so instead of 1000 * 1000 * (lock, release, do operation) you try to do 1000 * (lock, 1000 * operation, release).

Or, in case holding a lock is not relevant, try using a plain Mutex or if possible an AtomicX.

arc_swap - Rust is also a good candidate for some use cases that looks a bit like this.

2 Likes

These two versions should both have the futex implementation. You could try stable 1.61 to see how it behaved under the former pthread implementation.

1.61 (or earlier) has been tried a year ago.

my question is, if we check a atomic bool, try lock a result (simulated by fetch-add), get the lock(check writeflag) and unlock it (simulated by fetch-sub), all of 3 atomic ops could be SeqCst. And we could use about 300ms to finish the test procedure. (threads: 8; time used: 307.750834ms; ips: 25995055.467502)

use std::sync::Arc;
use std::thread;
use std::time;

use std::sync::atomic::{AtomicBool, AtomicIsize};

fn main() {
    for i in 1..=8 {
        workload(i);
    }
}

fn workload(concurrency: usize) {
    let total = 1000 * 1000;
    let m = Arc::new((AtomicIsize::new(0),AtomicIsize::new(0),AtomicBool::new(false)));

    let now = time::Instant::now();
    let threads: Vec<_> = (0..concurrency)
        .map(|_| {
            let m = m.clone();
            thread::spawn(move || {
                for _ in 0..total {
                    m.0.fetch_add(1,std::sync::atomic::Ordering::SeqCst); // add read lock, since we are read first, this is safe
                    if !m.2.load(std::sync::atomic::Ordering::SeqCst){ // check whether writer exists
                        //acquire the lock
                        m.0.fetch_sub(1,std::sync::atomic::Ordering::SeqCst); // release read lock
                    }else{
                        //deadcode now since we won't lock the writelock
                    }
                }
            })
        })
        .collect();

    for t in threads {
        t.join().unwrap();
    }
    let t = now.elapsed();
    println!(
        "threads: {}; time used: {:?}; ips: {}",
        concurrency,
        t,
        (total * concurrency) as f64 / t.as_secs_f64()
    );
}

what the RwLock do make it 3x slower?

I invite you to write a complete implementation, rather than trying to simulate it. For one, your code doesn't handle the possibility that a writer appeared between that m.2.load and m.0.fetch_sub. A real RwLock needs to handle that atomically and wake the waiting writer.

Here it is.

I don't wrote the code to wait for a read-write conflict, and the simple lock only allow one writer (since I set only 1 writer bit, multiple writer might write together)

#![feature(negative_impls)]
use std::sync::atomic::{AtomicIsize, Ordering};
use std::cell::UnsafeCell;
use std::ops::{Deref,DerefMut};


unsafe impl<T: ?Sized + Send> Send for RwLock<T> {}
unsafe impl<T: ?Sized + Send + Sync> Sync for RwLock<T> {}

#[derive(std::fmt::Debug)]
struct RwLock<T: ?Sized>(
    AtomicIsize,// highest flag is the writer lock.
    UnsafeCell<T>//data
);

impl<'a,T:?Sized> RwLock<T>{
    fn get(&self)->*mut T{self.1.get()}
    pub fn slow_read(&'a self)->RwLockReadGuard<'a,T>{
        while let Err(_)=self.0.fetch_update(Ordering::SeqCst,Ordering::SeqCst,|x|if x>=0{Some(x+1)}else{None}){
            /*wait*/
        }
        RwLockReadGuard{lock:self}
    }
    pub fn read(&'a self)->RwLockReadGuard<'a,T>{
        if self.0.fetch_add(1,Ordering::SeqCst)<0 {
            /*in queue*/
            while self.0.load(Ordering::SeqCst)<0{
                /*wait*/
            }
        }
        RwLockReadGuard{lock:self}
    }
    pub fn write(&'a self)->RwLockWriteGuard<'a,T>{
        while self.0.compare_and_swap(0,isize::MIN,Ordering::SeqCst)!=0{
            /*wait*/
        }
        RwLockWriteGuard{lock:self}
    }
    fn read_unlock(&'a self){
        self.0.fetch_sub(1,Ordering::SeqCst);
    }
    fn write_unlock(&'a self){
        self.0.fetch_xor(isize::MIN,Ordering::SeqCst);
    }
}



struct RwLockReadGuard<'a, T: ?Sized + 'a> {
    lock:&'a RwLock<T>
}
struct RwLockWriteGuard<'a, T: ?Sized + 'a> {
    lock:&'a RwLock<T>
}

impl<T: ?Sized> !Send for RwLockReadGuard<'_, T> {}

unsafe impl<T: ?Sized + Sync> Sync for RwLockReadGuard<'_, T> {}


impl<T: ?Sized> !Send for RwLockWriteGuard<'_, T> {}

unsafe impl<T: ?Sized + Sync> Sync for RwLockWriteGuard<'_, T> {}


impl<T: ?Sized> Deref for RwLockReadGuard<'_, T> {
    type Target = T;

    fn deref(&self) -> &T {
        unsafe { &*self.lock.get() }
    }
}

impl<T: ?Sized> Deref for RwLockWriteGuard<'_, T> {
    type Target = T;

    fn deref(&self) -> &T {
        unsafe { &*self.lock.get() }
    }
}

impl<T: ?Sized> DerefMut for RwLockWriteGuard<'_, T> {
    fn deref_mut(&mut self) -> &mut T {
        unsafe { &mut *self.lock.get() }
    }
}

impl<T: ?Sized> Drop for RwLockReadGuard<'_, T> {
    fn drop(&mut self) {
        unsafe {
            self.lock.read_unlock();
        }
    }
}

impl<T: ?Sized> Drop for RwLockWriteGuard<'_, T> {
    fn drop(&mut self) {
        unsafe {
            self.lock.write_unlock();
        }
    }
}

impl<'a,T>RwLock<T>{
    fn new(t:T)->Self{Self(AtomicIsize::new(0),UnsafeCell::new(t))}
}

use std::sync::Arc;
use std::thread;
use std::time;
fn main() {
    for i in 1..=8 {
        workload(i);
    }
}

fn workload(concurrency: usize) {
    let total = 1000 * 1000;
    let mut m = (); // Let us pretend it is a type that could be modified.
    let m = Arc::new(RwLock::new(m));

    let now = time::Instant::now();
    let threads: Vec<_> = (0..concurrency)
        .map(|x| {
            let m = m.clone();
            thread::spawn(move || {
                for _ in 0..total {
                    let _x = m.read(); // here, only read occurs, it should be multi-threaded, rather than read sequentially.
                    // but the timing shows that, with thread number increases, it become slower.
                }
                println!("thread {x} stop at {:?}",m)
            })
        })
        .collect();

    for t in threads {
        t.join().unwrap();
    }
    let t = now.elapsed();
    println!(
        "threads: {}; time used: {:?}; ips: {}",
        concurrency,
        t,
        (total * concurrency) as f64 / t.as_secs_f64()
    );
}

test result (use the fast read version, reader only need at least 1 Atomic ops to acquire the lock, but would try get the lock even if a writer exists):

Result: fast(300ms rather than ~1s) and accurate (no deadlock occurs after all threads finishes their work, which suggests my trait works.)

thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 1; time used: 18.057426ms; ips: 55378878.473598614
thread 0 stop at RwLock(0, UnsafeCell { .. })
thread 1 stop at RwLock(0, UnsafeCell { .. })
threads: 2; time used: 74.064488ms; ips: 27003494.5762401
thread 2 stop at RwLock(1, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
thread 1 stop at RwLock(0, UnsafeCell { .. })
threads: 3; time used: 108.150251ms; ips: 27739186.661711954
thread 0 stop at RwLock(3, UnsafeCell { .. })
thread 1 stop at RwLock(2, UnsafeCell { .. })
thread 3 stop at RwLock(0, UnsafeCell { .. })
thread 2 stop at RwLock(0, UnsafeCell { .. })
threads: 4; time used: 151.612219ms; ips: 26383097.78976324
thread 2 stop at RwLock(3, UnsafeCell { .. })
thread 0 stop at RwLock(2, UnsafeCell { .. })
thread 1 stop at RwLock(1, UnsafeCell { .. })
thread 4 stop at RwLock(1, UnsafeCell { .. })
thread 3 stop at RwLock(0, UnsafeCell { .. })
threads: 5; time used: 194.625087ms; ips: 25690418.830744054
thread 0 stop at RwLock(4, UnsafeCell { .. })
thread 2 stop at RwLock(2, UnsafeCell { .. })
thread 3 stop at RwLock(1, UnsafeCell { .. })
thread 1 stop at RwLock(2, UnsafeCell { .. })
thread 4 stop at RwLock(1, UnsafeCell { .. })
thread 5 stop at RwLock(0, UnsafeCell { .. })
threads: 6; time used: 237.383013ms; ips: 25275608.073944196
thread 1 stop at RwLock(2, UnsafeCell { .. })
thread 6 stop at RwLock(3, UnsafeCell { .. })
thread 5 stop at RwLock(2, UnsafeCell { .. })
thread 4 stop at RwLock(2, UnsafeCell { .. })
thread 0 stop at RwLock(1, UnsafeCell { .. })
thread 2 stop at RwLock(0, UnsafeCell { .. })
thread 3 stop at RwLock(0, UnsafeCell { .. })
threads: 7; time used: 277.655132ms; ips: 25211131.339722544
thread 7 stop at RwLock(2, UnsafeCell { .. })
thread 3 stop at RwLock(5, UnsafeCell { .. })
thread 2 stop at RwLock(2, UnsafeCell { .. })
thread 1 stop at RwLock(2, UnsafeCell { .. })
thread 5 stop at RwLock(0, UnsafeCell { .. })
thread 4 stop at RwLock(2, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
thread 6 stop at RwLock(0, UnsafeCell { .. })
threads: 8; time used: 309.183294ms; ips: 25874619.215357736

even if we change thread 0 from reader to writer, we could have a very fast running time (although this is mostly because writer wait for all readers finish their reading)

// this RwLock, we always have 0 lock after all threads finishes their work.
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 1; time used: 36.031348ms; ips: 27753610.550457343
thread 1 stop at RwLock(-9223372036854775808, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 2; time used: 137.578549ms; ips: 14537149.97386693
thread 2 stop at RwLock(1, UnsafeCell { .. })
thread 1 stop at RwLock(1, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 3; time used: 248.20223ms; ips: 12086917.994250093
thread 1 stop at RwLock(0, UnsafeCell { .. })
thread 3 stop at RwLock(1, UnsafeCell { .. })
thread 2 stop at RwLock(1, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 4; time used: 279.040162ms; ips: 14334854.06305061
thread 3 stop at RwLock(3, UnsafeCell { .. })
thread 1 stop at RwLock(2, UnsafeCell { .. })
thread 2 stop at RwLock(1, UnsafeCell { .. })
thread 4 stop at RwLock(-9223372036854775808, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 5; time used: 318.851729ms; ips: 15681269.835610647
thread 4 stop at RwLock(3, UnsafeCell { .. })
thread 5 stop at RwLock(2, UnsafeCell { .. })
thread 1 stop at RwLock(1, UnsafeCell { .. })
thread 3 stop at RwLock(1, UnsafeCell { .. })
thread 2 stop at RwLock(0, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 6; time used: 369.740098ms; ips: 16227615.1070853
thread 4 stop at RwLock(4, UnsafeCell { .. })
thread 6 stop at RwLock(2, UnsafeCell { .. })
thread 3 stop at RwLock(2, UnsafeCell { .. })
thread 1 stop at RwLock(1, UnsafeCell { .. })
thread 5 stop at RwLock(1, UnsafeCell { .. })
thread 2 stop at RwLock(-9223372036854775808, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 7; time used: 409.843766ms; ips: 17079679.089226406
thread 6 stop at RwLock(3, UnsafeCell { .. })
thread 1 stop at RwLock(2, UnsafeCell { .. })
thread 2 stop at RwLock(1, UnsafeCell { .. })
thread 4 stop at RwLock(3, UnsafeCell { .. })
thread 5 stop at RwLock(1, UnsafeCell { .. })
thread 3 stop at RwLock(1, UnsafeCell { .. })
thread 7 stop at RwLock(0, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 8; time used: 452.241421ms; ips: 17689666.688005567
// std::sync::RwLock
thread 0 stop at RwLock { data: (), poisoned: false, .. }
threads: 1; time used: 43.693939ms; ips: 22886469.4483141
thread 1 stop at RwLock { data: <locked>, poisoned: false, .. }
thread 0 stop at RwLock { data: (), poisoned: false, .. }
threads: 2; time used: 249.640683ms; ips: 8011514.693700786
thread 2 stop at RwLock { data: (), poisoned: false, .. }
thread 1 stop at RwLock { data: (), poisoned: false, .. }
thread 0 stop at RwLock { data: (), poisoned: false, .. }
threads: 3; time used: 422.014603ms; ips: 7108758.745962163
thread 2 stop at RwLock { data: (), poisoned: false, .. }
thread 1 stop at RwLock { data: (), poisoned: false, .. }
thread 3 stop at RwLock { data: <locked>, poisoned: false, .. }
thread 0 stop at RwLock { data: (), poisoned: false, .. }
threads: 4; time used: 696.15142ms; ips: 5745876.378446517
thread 2 stop at RwLock { data: (), poisoned: false, .. }
thread 4 stop at RwLock { data: (), poisoned: false, .. }
thread 3 stop at RwLock { data: (), poisoned: false, .. }
thread 1 stop at RwLock { data: (), poisoned: false, .. }
thread 0 stop at RwLock { data: (), poisoned: false, .. }
threads: 5; time used: 637.371658ms; ips: 7844716.559392417
thread 4 stop at RwLock { data: (), poisoned: false, .. }
thread 3 stop at RwLock { data: (), poisoned: false, .. }
thread 2 stop at RwLock { data: (), poisoned: false, .. }
thread 5 stop at RwLock { data: (), poisoned: false, .. }
thread 1 stop at RwLock { data: (), poisoned: false, .. }
thread 0 stop at RwLock { data: (), poisoned: false, .. }
threads: 6; time used: 833.199777ms; ips: 7201154.11168671
thread 2 stop at RwLock { data: (), poisoned: false, .. }
thread 4 stop at RwLock { data: (), poisoned: false, .. }
thread 6 stop at RwLock { data: <locked>, poisoned: false, .. }
thread 3 stop at RwLock { data: <locked>, poisoned: false, .. }
thread 5 stop at RwLock { data: <locked>, poisoned: false, .. }
thread 1 stop at RwLock { data: <locked>, poisoned: false, .. }
thread 0 stop at RwLock { data: (), poisoned: false, .. }
threads: 7; time used: 897.798439ms; ips: 7796850.268303931
thread 1 stop at RwLock { data: (), poisoned: false, .. }
thread 4 stop at RwLock { data: (), poisoned: false, .. }
thread 3 stop at RwLock { data: <locked>, poisoned: false, .. }
thread 5 stop at RwLock { data: (), poisoned: false, .. }
thread 6 stop at RwLock { data: (), poisoned: false, .. }
thread 2 stop at RwLock { data: (), poisoned: false, .. }
thread 7 stop at RwLock { data: (), poisoned: false, .. }
thread 0 stop at RwLock { data: (), poisoned: false, .. }
threads: 8; time used: 1.03370633s; ips: 7739141.928249584

This is essentially a spin lock, which has different performance characteristics than one that blocks and sleeps. Your unlock methods don't have to worry about waking other threads, and as you noted you're also starving any writers. If you proceed to implement a real /*wait*/, then your performance will change.

I don't want to discourage you -- it may well be that you can come up with a better overall design, but you do need implement the entire feature before you can compare it. The little things that you're skipping do matter. Maybe instead of writing it from scratch, you could try modifying the actual futex_rwlock.rs, and if you do find an improvement then you can send a pull request!

1 Like

well, submitting PRs is quite good suggestions, but how to modify the std crate?

it is not very happy to compile the std crate.

what's more, I'm writting a spin lock like RwLock just for convenience, not for cheats.

for the basic comparation, no write occurs, thus no sleep-awake should occurs.
for starving writers, isn't it the default behavior of the default RwLock?

it is not only my version but also the std version, writers ends after all readers.

Update: my fault, the write thread have more work to do, it should end up at last no matter what the setting is. the slow_read function won't starve the writer (although the fast read lock do starving writer).

/* results and the code been tested.
write done
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 1; time used: 33.984693ms; ips: 29425012.01938178
write done
thread 1 stop at RwLock(0, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 2; time used: 127.366221ms; ips: 15702750.574659823
write done
thread 0 stop at RwLock(1, UnsafeCell { .. })
thread 1 stop at RwLock(1, UnsafeCell { .. })
thread 2 stop at RwLock(0, UnsafeCell { .. })
threads: 3; time used: 212.245925ms; ips: 14134546.988357963
write done
thread 2 stop at RwLock(0, UnsafeCell { .. })
thread 0 stop at RwLock(1, UnsafeCell { .. })
thread 1 stop at RwLock(0, UnsafeCell { .. })
thread 3 stop at RwLock(0, UnsafeCell { .. })
threads: 4; time used: 284.918678ms; ips: 14039093.639203254
write done
thread 4 stop at RwLock(0, UnsafeCell { .. })
thread 1 stop at RwLock(1, UnsafeCell { .. })
thread 3 stop at RwLock(1, UnsafeCell { .. })
thread 2 stop at RwLock(1, UnsafeCell { .. })
thread 0 stop at RwLock(0, UnsafeCell { .. })
threads: 5; time used: 352.860565ms; ips: 14169903.060717482
write done
thread 4 stop at RwLock(1, UnsafeCell { .. })
thread 1 stop at RwLock(1, UnsafeCell { .. })
thread 0 stop at RwLock(2, UnsafeCell { .. })
thread 3 stop at RwLock(1, UnsafeCell { .. })
thread 5 stop at RwLock(1, UnsafeCell { .. })
thread 2 stop at RwLock(0, UnsafeCell { .. })
threads: 6; time used: 438.080775ms; ips: 13696104.331444357
thread 0 stop at RwLock(2, UnsafeCell { .. })
thread 5 stop at RwLock(2, UnsafeCell { .. })
thread 4 stop at RwLock(0, UnsafeCell { .. })
thread 2 stop at RwLock(1, UnsafeCell { .. })
write done
thread 1 stop at RwLock(0, UnsafeCell { .. })
thread 3 stop at RwLock(0, UnsafeCell { .. })
thread 6 stop at RwLock(0, UnsafeCell { .. })
threads: 7; time used: 490.712981ms; ips: 14264957.869537182
thread 2 stop at RwLock(1, UnsafeCell { .. })
write done
thread 5 stop at RwLock(3, UnsafeCell { .. })
thread 6 stop at RwLock(2, UnsafeCell { .. })
thread 7 stop at RwLock(1, UnsafeCell { .. })
thread 1 stop at RwLock(0, UnsafeCell { .. })
thread 4 stop at RwLock(0, UnsafeCell { .. })
thread 0 stop at RwLock(1, UnsafeCell { .. })
thread 3 stop at RwLock(0, UnsafeCell { .. })
threads: 8; time used: 592.902396ms; ips: 13492945.978919605
*/
#![feature(negative_impls)]
use std::sync::atomic::{AtomicIsize, Ordering};
use std::cell::UnsafeCell;
use std::ops::{Deref,DerefMut};
use std::thread::sleep;

unsafe impl<T: ?Sized + Send> Send for RwLock<T> {}
unsafe impl<T: ?Sized + Send + Sync> Sync for RwLock<T> {}

#[derive(std::fmt::Debug)]
struct RwLock<T: ?Sized>(
    AtomicIsize,// highest flag is the writer lock.
    UnsafeCell<T>//data
);

impl<'a,T:?Sized> RwLock<T>{
    fn get(&self)->*mut T{self.1.get()}
    pub fn read(&'a self)->RwLockReadGuard<'a,T>{
        while let Err(_)=self.0.fetch_update(Ordering::SeqCst,Ordering::SeqCst,|x|if x>=0{Some(x+1)}else{None}){
            /*wait*/
            sleep(std::time::Duration::from_nanos(1));
        }
        RwLockReadGuard{lock:self}
    }
    pub fn fast_read(&'a self)->RwLockReadGuard<'a,T>{
        if self.0.fetch_add(1,Ordering::SeqCst)<0 {
            /*in queue*/
            while self.0.load(Ordering::SeqCst)<0{
                /*wait*/
                sleep(std::time::Duration::from_nanos(100));
            }
        }
        RwLockReadGuard{lock:self}
    }
    pub fn write(&'a self)->RwLockWriteGuard<'a,T>{
        while self.0.compare_and_swap(0,isize::MIN,Ordering::SeqCst)!=0{
            /*wait*/
            sleep(std::time::Duration::from_nanos(10));
        }
        RwLockWriteGuard{lock:self}
    }
    fn read_unlock(&'a self){
        self.0.fetch_sub(1,Ordering::SeqCst);
    }
    fn write_unlock(&'a self){
        self.0.fetch_xor(isize::MIN,Ordering::SeqCst);
    }
}



struct RwLockReadGuard<'a, T: ?Sized + 'a> {
    lock:&'a RwLock<T>
}
struct RwLockWriteGuard<'a, T: ?Sized + 'a> {
    lock:&'a RwLock<T>
}

impl<T: ?Sized> !Send for RwLockReadGuard<'_, T> {}

unsafe impl<T: ?Sized + Sync> Sync for RwLockReadGuard<'_, T> {}


impl<T: ?Sized> !Send for RwLockWriteGuard<'_, T> {}

unsafe impl<T: ?Sized + Sync> Sync for RwLockWriteGuard<'_, T> {}


impl<T: ?Sized> Deref for RwLockReadGuard<'_, T> {
    type Target = T;

    fn deref(&self) -> &T {
        unsafe { &*self.lock.get() }
    }
}

impl<T: ?Sized> Deref for RwLockWriteGuard<'_, T> {
    type Target = T;

    fn deref(&self) -> &T {
        unsafe { &*self.lock.get() }
    }
}

impl<T: ?Sized> DerefMut for RwLockWriteGuard<'_, T> {
    fn deref_mut(&mut self) -> &mut T {
        unsafe { &mut *self.lock.get() }
    }
}

impl<T: ?Sized> Drop for RwLockReadGuard<'_, T> {
    fn drop(&mut self) {
        unsafe {
            self.lock.read_unlock();
        }
    }
}

impl<T: ?Sized> Drop for RwLockWriteGuard<'_, T> {
    fn drop(&mut self) {
        unsafe {
            self.lock.write_unlock();
        }
    }
}

impl<'a,T>RwLock<T>{
    fn new(t:T)->Self{Self(AtomicIsize::new(0),UnsafeCell::new(t))}
}

//use std::sync::RwLock;
use std::sync::Arc;
use std::thread;
use std::time;
fn main() {
    for i in 1..=8 {
        workload(i);
    }
}

fn workload(concurrency: usize) {
    let total = 1000 * 1000;
    let mut m = (); // Let us pretend it is a type that could be modified.
    let m = Arc::new(RwLock::new(m));

    let now = time::Instant::now();
    let mut threads: Vec<_> = (0..concurrency)
        .map(|x| {
            let m = m.clone();
            thread::spawn(move || {
                for y in 0..total {
                    let x2 = m.read(); // here, only read occurs, it should be multi-threaded, rather than read sequentially.
                    // but the timing shows that, with thread number increases, it become slower.
                }
                println!("thread {x} stop at {:?}",m)
            })
        })
        .collect();
        threads.push(thread::spawn(move||{
            for y in 0..total/2{
                let _=m.write();
            }
            println!("write done");
        }));

    for t in threads {
        t.join().unwrap();
    }
    let t = now.elapsed();
    println!(
        "threads: {}; time used: {:?}; ips: {}",
        concurrency,
        t,
        (total * concurrency) as f64 / t.as_secs_f64()
    );
}

There are people in t-compiler/help on zulip that can help you with that.

Or you could copy that code into a standalone crate. You'll also need to copy the futex module for those syscall wrappers, but otherwise I think it should work fine on its own.

I was not trying to accuse you of anything, but I'm saying it's not sufficient for a performance comparison, unless your goal is to compare spin locks vs. blocking locks. If you want to demonstrate a better blocking RwLock, you need to make yours block as well.

Yes, all current readers need to unlock, but the current std version does not let any new readers lock when there's a writer waiting. Writers get priority access in that sense. Your version always admits new readers when it's already read locked, even in slow_read.

1 Like